Decades of data about the global burden of disease (measured in disability-adjusted life years) were cleaned, interpreted and visualised. After this, a linear regression was done to create a model that can predict (up to an accuracy of 85.7%) the burden of disease in the future, adjustable to changes in demographics, health systems, diet, education, and so on.
This presentation was created as a group project during the Business Analytics course at London Business School.
The Burden of Disease: Data analysis, interpretation and linear regression
1. The Burden of Disease
Group 30
Aman Desai, Jim Huang, Gloria Marín, Carmen Chen, Dimitris Charitos, Lorenzo Gherardi
Index
• Introduction to the case
• Methodology
• Data Visualization
• Initial Regressions
• Moving Forward
2. The Burden of Disease
[Slogan here]
Index
• Introduction to the case
• Main variables
• Exploratory Analysis
• Initial Regressions
• Moving Forward
DALYs
Disability
Adjusted
Life
Years
• Metric used to measure the
Burden of Disease
• DALY includes the sum of
mortality and morbidity due
to a specific disease
• One DALY = loss of 1 year
in good health because of
• Premature death
• Disease
• Disability
- Mortality is used as a method to assess a
population's health
- Through ‘child mortality’
- Through ‘life expectancy’
- The Problem with this method is that it does
not account for a population that lives through
suffering due to a disease which otherwise
prevents a normal life.
- For people to get healthy, attention needs to
be given to the impact on the lives of people
suffering with disease. Years of contribution to
ones one’s community, industry, and nation,
are lost.
ASSIGNMENT PURPOSE:
1. Understand causes
2. Identify factors that magnify its impact
Introduction to the Case
3. The Burden of Disease
[Slogan here]
Introduction to the Case: background and preparation
Communicable diseases
Non-communicable
diseases (NCDs)
Injuries
Diarrhoea, lower
respiratory & other
common infectious
diseases
Cardiovascular diseases
(inc. stroke, heart disease
and heart failure)
Road injuries
Neonatal disorders Cancers
Other transport
injuries
Maternal disorders Respiratory disease Falls
Malaria & neglected
tropical diseases
Diabetes, blood and
endocrine diseases
Drowning
Nutritional deficiencies
Mental and substance use
disorders
Fire, heat and hot
substances
HIV/AIDS Liver diseases Poisonings
Tuberculosis Digestive diseases Self-harm
Other communicable
diseases
Musculoskeletal disorders Interpersonal violence
Neurological disorders
(including dementia)
Conflict & terrorism
Other NCDs Natural disasters
4. Methodology: Linear Model and Data Visualization
Step 1. Identify relevant variables
• Explanatory Variable (x): Select factors covering the following dimensions from hundreds of
other factors: Diet habit (E.g., fruit consumption), Healthcare level (E.g., healthcare expense),
Living habit (smoking %), and Other demographics (E.g., education, overweight %)
• Response Variable (y): Choose ‘Overall DALYs’, ‘Communicable Diseases DALYs’, ‘Non-
Communicable Diseases DALYs’, and ‘Injuries DALYs’ as our response variable from 24
possible variable by comparing the models
Step 2. Check for non-linear relations
Step 3. Generate the linear regression and prediction model
• Dropped all the insignificant level
• Checked the VIF to eliminate the risk of multilinearity
Data Visualization
• Bar Chart and Stacked Bar Chart: Compare causes of DALYs by continent
• Area Chart: Look into the DALY rate over 2000~2017 by continent
• Scatter Plot: Measure DALY due to Proportion of GDP spent on Healthcare by continent
Linear Model
5. Accumulated DALYs per Capita 1980 - 2017
Due to Communicable Diseases
Due to Non-Communicable
Diseases
Due to Injuries
• Overall, Africa has the highest accumulated
average DALY rate (9), followed by Asia (5),
and Oceania (4).
• The high contrast of Africa is mainly due to
communicable diseases, with a rate triple of
that of the next highest continent, Asia
• DALY rate for non-communicable diseases
and injuries are relatively uniform around the
world.
• Africa's communicable DALY rate has
declined since 2008. Despite this, the
burden of disease on the continent remains
high and this leaves room to consider the
causes and potential solutions for this.
Summary
Communicable DALYs over 1980 - 2017
Results & Conclusion: Data Visualization (1)
6. • The variables affecting DALY have been
further broken down. The largest contribution
factor found were:
Ø Cardiovascular Diseases for Non-
Communicable Diseases, and
Ø Unintentional Injuries for Injuries.
• There was no significant cause of disease
found across the different continents.
• GDP has a negative correlation to DALY
for Communicable Diseases, however, the
proprtion of GDP used has a positive
corrlation to the same. This could be
because poorer countries have a higher
liklihood of having to combat communicable
diseases and as a result spend more of their
GDP on healthcare.
Summary
Results & Conclusion: Data Visualization (2)
Causes of DALYs by Continent
due to Non-Communicable Disease
Causes of DALYs by Continent
due to Injuries
Healthcare Expense vs. DALY from
Non-Communicable Disease
Healthcare Expense vs. DALY from
Communicable Disease
Analysis of the Correlation among Variables
7. 1. All explanatory variables’ Pr(>|t|) < 0.01
Results & Conclusion: Final Model
DALY= 66523 - 202.37 overweight% - 33.86 veg_consump - 1030.84 animal_protein_consump -534.61 education - 8.67
pocket_per_cap - 40.86 fruit_consump -7140.44 Asia + 13792.58 Africa -9335.46 NorthA -5196.99 Europe -9146.72 SouthA
Model of best fit
2. VIF (Variance Inflation Factor) is <10
• By ruling out all insignificant variables,
we had 7 variables in our best model.
• The risk of multicollinearity was checked
by ensuring that VIF <10.
• The high R-squared obtained (75.35%)
suggests that the model explains the
variance of DALY accurately.
Summary
8. Results & Conclusion: Prediction
Step 1. Using our linear model, we have estimated the DALY rates worldwide for
2013 using our data for all the years until 2012.
Step 2. The data was filtered to all periods before 2013 and a linear model was created.
Step 3. Using the Linear Model, data for 2013 was predicted.
Step 4. Compared to the actual data available for 2013, the accuracy was determined
Prediction Accuracy was 85.7%
Prediction
9. Moving Forward: Adding New Variables
What other ‘external’ elements may be magnifying results?
COMMON TO ALL
• Percentage of population insured with health insurance.
• Number of medical doctors per 1,000 people.
• Number of nurses per 1,000 people.
• Out-of-pocket expenditure for healthcare.
SPECIFIC TO
a) Communicable, maternal, neonatal, and nutritional diseases
• Nutritional deficiencies.
• Hygiene practices.
• Housing space per person.
b) Non-communicable diseases
• Physical inactivity.
• Wellbeing.
• Genetics.
c) Injuries
• Surveillance.
• Regulations for safety.
10. The Burden of Disease
Group 30 - Aman Desai, Jim Huang, Gloria Marín, Carmen Chen, Dimitris Charitos, Lorenzo Gherardi
Introduction
A glance of DALY
Linear Regressions
DALY= 66523 - 202.37 overweight% - 33.86 veg_consum - 1030.84 animal_consum -534.61 education - 8.67 pocket/cap
- 40.86 fruit_consum -7140.44 Asia + 13792.58 Africa -9335.46 NorthA -5196.99 Europe -9146.72 SouthA
All explanatory variables’ Pr(>|t|) < 0.01 VIF (Variance Inflation Factor) is <10
Conclusions
Moving Forward
Methodology (Model)
Model of best fit
Disability
Adjusted
Life
Years
DALYs
• Metric used to measure the Burden of Disease
• It includes the sum of mortality and morbidity
• DALY = loss of 1 year in good health because of
Premature death, Disease, Disability
Aim of study
• Understand causes
• Identify factors that magnify the impact
Background & Preparation
Burden of Disease, 2017
Disease Burden due to Communicable disease vs GDP per capita
Category of Disease
• Communicable disease
• Non-Communicable
disease (e.g., Cancers)
• Injuries (e.g., Falls, Fire)
DALY Around the World Due to Communicable Disease
Due to Non-Communicable Disease
Step 1. Identify relevant variables
• Explanatory Variable (x): Select factors covering following
dimensions from hundreds of other factors: Diet habit (E.g.,
fruit consumption), Healthcare level (E.g., healthcare expense),
Living habit (smoking %), and Other demographics (E.g.,
education, overweight %
• Response Variable (y): Choose ‘Overall DALYs’, ‘Communicable
Diseases DALYs’, ‘Non-Communicable Diseases DALYs’, and
‘Injuries DALYs’ as our response variable from 24 possible
variable by comparing the models
Step 2. Check for non-linear relations
Step 3. Generate the regression model
• Dropped all the insignificant level
• Checked the VIF to eliminate the risk of multilinearity
Statistical Technique
Healthcare Expense vs.
DALY from Non- and
Communicable Disease
Stacked Bar Chart - Causes of
DALYs by continent for Non-
Communicable Disease and
Injuries
What other ‘external’ elements may be magnifying results?
• Common to all::
• Percentage of population insured with health insurance
• Number of medical doctors/Nurse per 1,000 people
• Specific to
a) Communicable diseases : Family size
b) Non-communicable diseases: Literacy rate
c) Injuries: Alcohol consumption
Linear Regression
• The best model had 7 variables (overweight%, veg_consum, animal_consum,
education, pocket/cap, fruit _consum, continent) including in best model with
all the variavles Pr(>|t|) < 0.01 and VIF<10
• High R-squared (75.35%) suggests the model explain the variance of DALY well
Data Visualization
• Africa has the highest Avg. DALY rate (c 9), followed by Asia (c 5), and Oceania.
• The high contrast of Africa is mainly due to communicable diseases with a rate
more than triples the second highest continent.
• Africa‘s communicable DALY rate declines since 2008 but remain high over other
continents, leaving room to further consider the causes and potential solutions
Communicable DALYs over 1980 - 2017
Analysis of Correlation among
Variables
11. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 1/39
CM30_GroupProject_SG30
Team 30
2021-02-14
1 Burden of Disease
Mortality rates are a common method used to assess a population’s health. Often used rates for such assessment
include child mortality or life expectancy. However, a focus on mortality neglects the suffering caused to people who
still live with the disease. A disease impacts, in a direct or indirect manner, the ability of living a normal life. Potential
contributions to one’s community, work, or nation, are often lost.
Our study, therefore, seeks to understand the magnitude of the burden of diseases by the different disease types, as
well as identify factors that amplify such effects.
The metric that will be used to measure disease burden is called DALY, which stands for Disability Adjusted Life Years.
This metric includes the sum of mortality and morbidity. One DALY stands for 1 year loss in good health due to either
premature death, disease, or disability.
1.1 Data import and inspection
1.1.0.1 Importing data for overall disease burden (DALY)
Rows: 48,698
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder_hyde_un <dbl> …
$ continent <chr> …
$ health_expenditure_per_capita_current_us <dbl> …
$ dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate <dbl> …
Code
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading first file
daly_total <- read_csv(here::here('Data',"disease-burden-vs-health-expenditure-per-capita.csv")) %>%
clean_names()
# Checking for variable types
glimpse(daly_total)
Hide
# Changing variable names and variable types
daly_total<- daly_total %>%
mutate(
location=as.factor(entity),
period=year,
health_expenditure_per_capita=health_expenditure_per_capita_current_us,
daly_adjusted=dal_ys_disability_adjusted_life_years_all_causes_sex_both_age_age_standardized_rate,
total_population = total_population_gapminder_hyde_un) %>%
select(location,period,daly_adjusted,health_expenditure_per_capita,total_population)
1 Burden of Disease
12. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 2/39
Although important as a whole, DALY rates can futher be divided into 3 sub-categories of disease cause; these being:
communicable diseases, non-communicable diseases, and injuries. We, therefore, included the datasets for each
individual subcategory below.
1.1.0.2 Adding data for burden of non-communicable diseases
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ncds <dbl> 41145.51, 40587.17, 39644.60, 39821.31, 40641.76, 40790.73,…
1.1.0.3 Adding data for burden from communicable, neonatal, maternal and nutritional diseases
Rows: 6,468
Columns: 4
$ entity
<chr> …
$ code
<chr> …
$ year
<dbl> …
$ dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex_both_age_age_stan
dardized_rate <dbl> …
Hide
#source:https://ourworldindata.org/burden-of-disease
#Reading the file
ncds <- read_csv(here::here('Data',"burden-of-disease-rates-from-ncds.csv")) %>%
clean_names()
# Checking for variable types
glimpse(ncds)
# Changing variable names and variable types
ncds<- ncds %>%
mutate(location=as.factor(entity),
period=year,
daly_ncds=dal_ys_disability_adjusted_life_years_non_communicable_diseases_sex_both_age_age_standardized_rat
e) %>%
select(location,period,daly_ncds)
glimpse(ncds)
#Merging data frames
total <- merge(daly_total,ncds,by=c("location","period"))
Hide
#source:https://ourworldindata.org/burden-of-disease
#Reading the file
cnmnd <- read_csv(here::here('Data',"burden-of-disease-rates-from-communicable-neonatal-maternal-nutritional-disease
s.csv")) %>%
clean_names()
# Checking for variable types
glimpse(cnmnd)
Hide
13. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 3/39
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghan…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999…
$ daly_cnmnd <dbl> 51181.84, 47263.29, 38908.25, 36882.69, 38809.79, 38262.20…
1.1.0.4 Adding data for burden from injuries, violence, self-harm and accidents
Rows: 6,468
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate <dbl> …
Rows: 6,468
Columns: 3
$ location <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
$ period <dbl> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,…
$ daly_ivsa <dbl> 11775.715, 13390.289, 12365.622, 11530.363, 13546.148, 1238…
Within each of the 3 sub-categories of disease causes, there are speci c diseases that classify as such. We included all
categories in our dataset.
1.1.0.5 Adding data for disease burden by cause (DALY by cause)
# Changing variable names and variable types
cnmnd<- cnmnd %>%
mutate(location=as.factor(entity),
period=year,
daly_cnmnd=dal_ys_disability_adjusted_life_years_communicable_maternal_neonatal_and_nutritional_diseases_sex
_both_age_age_standardized_rate) %>%
select(location,period,daly_cnmnd)
glimpse(cnmnd)
Hide
#Merging data frames
total <- merge(total,cnmnd,by=c("location","period"))
Hide
#source:https://ourworldindata.org/burden-of-disease
#Reading the file
ivsa <- read_csv(here::here('Data',"burden-of-disease-rates-from-injuries.csv")) %>%
clean_names()
# Checking for variable types
glimpse(ivsa)
Hide
# Changing variable names and variable types
ivsa<- ivsa %>%
mutate(location=as.factor(entity),
period=year,
daly_ivsa=dal_ys_disability_adjusted_life_years_injuries_sex_both_age_age_standardized_rate) %>%
select(location,period,daly_ivsa)
glimpse(ivsa)
Hide
#Merging data frames
total <- merge(total,ivsa,by=c("location","period"))
14. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 4/39
Aside from the main variables, additional variables that may be contributing to the nal effect of DALY rates were
included in the dataset.
1.1.0.6 Adding data for GDP per capita
Hide
#source: https://ourworldindata.org/burden-of-disease
# Reading second file
daly_by_cause <- read_csv(here::here('Data',"burden-of-disease-by-cause.csv")) %>%
clean_names()
# Checking for variable types
#glimpse(daly_by_cause)
# Changing variable names and variable types
daly_by_cause <- daly_by_cause %>%
mutate(
location=as.factor(entity),
period=year,
daly_conflict_terrorism=dal_ys_disability_adjusted_life_years_conflict_and_terrorism_sex_both_age_all_ages_numbe
r,
daly_hiv_tuberculosis=dal_ys_disability_adjusted_life_years_hiv_aids_and_tuberculosis_sex_both_age_all_ages_numbe
r,
daly_diahrrea_respiratory=dal_ys_disability_adjusted_life_years_diarrhea_lower_respiratory_and_other_common_infec
tious_diseases_sex_both_age_all_ages_number,
daly_cvs=dal_ys_disability_adjusted_life_years_cardiovascular_diseases_sex_both_age_all_ages_number,
daly_self_harm=dal_ys_disability_adjusted_life_years_self_harm_sex_both_age_all_ages_number,
daly_violence=dal_ys_disability_adjusted_life_years_interpersonal_violence_sex_both_age_all_ages_number,
daly_nutritional_deficiencies=dal_ys_disability_adjusted_life_years_nutritional_deficiencies_sex_both_age_all_age
s_number,
daly_transport_injuries=dal_ys_disability_adjusted_life_years_transport_injuries_sex_both_age_all_ages_number,
daly_unintentional_injuries=dal_ys_disability_adjusted_life_years_unintentional_injuries_sex_both_age_all_ages_nu
mber,
daly_maternal_disorders=dal_ys_disability_adjusted_life_years_maternal_disorders_sex_both_age_all_ages_number,
daly_neonatal_disorders=dal_ys_disability_adjusted_life_years_neonatal_disorders_sex_both_age_all_ages_number,
daly_other_communicable=dal_ys_disability_adjusted_life_years_other_communicable_maternal_neonatal_and_nutritiona
l_diseases_sex_both_age_all_ages_number,
daly_nature_forces=dal_ys_disability_adjusted_life_years_exposure_to_forces_of_nature_sex_both_age_all_ages_numbe
r,
daly_chronic_respiratory=dal_ys_disability_adjusted_life_years_chronic_respiratory_diseases_sex_both_age_all_ages
_number,
daly_chronic_liver=dal_ys_disability_adjusted_life_years_cirrhosis_and_other_chronic_liver_diseases_sex_both_age_
all_ages_number,
daly_digestive=dal_ys_disability_adjusted_life_years_digestive_diseases_sex_both_age_all_ages_number,
daly_tropical_and_malaria=dal_ys_disability_adjusted_life_years_neglected_tropical_diseases_and_malaria_sex_both_
age_all_ages_number,
daly_musculoskeletal=dal_ys_disability_adjusted_life_years_musculoskeletal_disorders_sex_both_age_all_ages_numbe
r,
daly_other_non_communicable=dal_ys_disability_adjusted_life_years_other_non_communicable_diseases_sex_both_age_al
l_ages_number,
daly_neurological=dal_ys_disability_adjusted_life_years_neurological_disorders_sex_both_age_all_ages_number,
daly_mental_and_substance=dal_ys_disability_adjusted_life_years_mental_and_substance_use_disorders_sex_both_age_a
ll_ages_number,
daly_diabetes_urogenital_blood_endocrine=dal_ys_disability_adjusted_life_years_diabetes_urogenital_blood_and_endo
crine_diseases_sex_both_age_all_ages_number,
daly_neoplasms=dal_ys_disability_adjusted_life_years_neoplasms_sex_both_age_all_ages_number)%>%
select(location, period,daly_conflict_terrorism,daly_hiv_tuberculosis,daly_diahrrea_respiratory,daly_cvs,daly_self_
harm,daly_violence,daly_nutritional_deficiencies,daly_transport_injuries,daly_unintentional_injuries,daly_mat
ernal_disorders,daly_neonatal_disorders,daly_other_communicable,daly_nature_forces,daly_chronic_respiratory,d
aly_chronic_liver,daly_digestive,daly_tropical_and_malaria,daly_musculoskeletal,daly_other_non_communicable,d
aly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocrine,daly_neoplasms)
#glimpse(daly_by_cause)
# Merging dataframes
total <- merge(total,daly_by_cause,by=c("location","period"))
#We will consider taking out health expenditure per capita since it has a complete rate of 57.4% and may distort the
final data.
Hide
15. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 5/39
#source: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
# Reading third file
gdp <- read_csv(here::here('Data',"API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1926744.csv"),skip=3) %>%
clean_names()
# Checking for variable types
glimpse(gdp)
17. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 7/39
1.1.0.7 Adding data for smoking percentages
1.1.0.8 Adding data for healthcare expenditure per capita
Rows: 4,675
Columns: 4
$ entity <chr> "Afghan…
$ code <chr> "AFG", …
$ year <dbl> 2002, 2…
$ health_expenditure_per_capita_ppp_constant_2011_international <dbl> 75.9835…
# Changing variable names and variable types
gdp <- gdp %>%
gather(year, gdp,-c(country_name, country_code,indicator_name,indicator_code)) %>%
mutate(location=as.factor(country_name),
period=readr::parse_number(year)) %>%
select(location,period,gdp)
# Merging dataframes
total <- merge(total,gdp,by=c("location","period"))
#skim(total)
Hide
#source: http://ghdx.healthdata.org/record/ihme-data/gbd-2015-smoking-prevalence-1980-2015
#Reading fourth file
smoking_percentage <- read_csv(here::here('Data',"IHME_GBD_2015_SMOKING_PREVALENCE_1980_2015_Y2017M04D05.CSV")) %>%
clean_names()
# Checking for variable types
#skim(smoking_percentage)
# Changing variable names and variable types
smoking_percentage <- smoking_percentage %>%
filter(age_group_name=="Age-standardized",
metric=="Percent",
sex=="Both") %>%
mutate(location=as.factor(location_name),
period=year_id,
smoking_percentage=mean) %>%
select(location,period,smoking_percentage)
#skim(smoking_percentage)
#Merging data frames
total <- merge(total,smoking_percentage,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/annual-healthcare-expenditure-per-capita?tab=chart&time=1995..2014®ion=
World
#Reading fifth file
healthcare_expenditure <- read_csv(here::here('Data',"annual-healthcare-expenditure-per-capita.CSV")) %>%
clean_names()
# Checking for variable types
glimpse(healthcare_expenditure)
Hide
18. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 8/39
1.1.0.9 Adding data for percentage of population being overweight
Rows: 8,316
Columns: 4
$ entity <chr> "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AF…
$ year <dbl> 1975, 1976, 1977,…
$ prevalence_of_overweight_adults_both_sexes_who_2019 <dbl> 5.3, 5.5, 5.7, 5.…
1.1.0.10 Adding data for fruit consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afg…
$ code <chr> "AFG…
$ year <dbl> 1961…
$ fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 41.1…
# Changing variable names and variable types
healthcare_expenditure <- healthcare_expenditure %>%
mutate(location=as.factor(entity),
period=year,
healthcare_expenditure=health_expenditure_per_capita_ppp_constant_2011_international) %>%
select(location,period,healthcare_expenditure)
#glimpse(healthcare_expenditure)
#Merging data frames
total <- merge(total,healthcare_expenditure,by=c("location","period"))
Hide
#source: https://ourworldindata.org/obesity
#Reading sixth file
percentage_overweight <- read_csv(here::here('Data',"share-of-adults-who-are-overweight.csv")) %>%
clean_names()
# Checking for variable types
glimpse(percentage_overweight)
Hide
# Changing variable names and variable types
percentage_overweight <- percentage_overweight %>%
mutate(location=as.factor(entity),
period=year,
percentage_overweight=prevalence_of_overweight_adults_both_sexes_who_2019) %>%
select(location,period,percentage_overweight)
#glimpse(percentage_overweight)
#Merging data frames
total <- merge(total,percentage_overweight,by=c("location","period"))
Hide
#source: https://ourworldindata.org/diet-compositions
#Reading seventh file
fruit_consumption <- read_csv(here::here('Data',"fruit-consumption-per-capita.csv")) %>%
clean_names()
# Checking for variable types
glimpse(fruit_consumption)
Hide
19. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 9/39
1.1.0.11 Adding data for vegetable consumption per capita
Rows: 11,028
Columns: 4
$ entity <chr> "Afghanistan", …
$ code <chr> "AFG", "AFG", "…
$ year <dbl> 1961, 1962, 196…
$ vegetables_food_supply_quantity_kg_capita_yr_fao_2020 <dbl> 36.75, 37.47, 3…
1.1.0.12 Adding data for animal based foods consumption per capita
# Changing variable names and variable types
fruit_consumption <- fruit_consumption %>%
mutate(location=as.factor(entity),
period=year,
fruit_consumption=fruits_excluding_wine_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,fruit_consumption)
#glimpse(fruit_consumption)
#Merging data frames
total <- merge(total,fruit_consumption,by=c("location","period"))
Hide
#source: https://ourworldindata.org/diet-compositions
#Reading eigth file
vegetable_consumption <- read_csv(here::here('Data',"vegetable-consumption-per-capita.csv")) %>%
clean_names()
#Checking for variable types
glimpse(vegetable_consumption)
Hide
## Changing variable names and variable types
vegetable_consumption <- vegetable_consumption %>%
mutate(location=as.factor(entity),
period=year,
vegetable_consumption=vegetables_food_supply_quantity_kg_capita_yr_fao_2020) %>%
select(location,period,vegetable_consumption)
#glimpse(vegetable_consumption)
#Merging dataframes
total <- merge(total,vegetable_consumption,by=c("location","period"))
#skim(total)
Hide
#source: https://ourworldindata.org/diet-compositions
#Reading ninth file
animal_protein_consumption <-read_csv(here::here('Data',"share-of-calories-from-animal-protein-vs-gdp-per-capita.csv"
)) %>%
clean_names()
#Checking for variable types
glimpse(animal_protein_consumption)
20. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 10/39
Rows: 24,472
Columns: 7
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ total_population_gapminder <dbl> …
$ continent <chr> …
$ share_of_calories_from_animal_protein_fao_2017 <dbl> …
$ real_gdp_per_capita_in_2011us_2011_benchmark_maddison_project_database_2018 <dbl> …
1.1.0.13 Adding data for mean years of schooling
1.1.0.14 Adding data for physicians per 1000 people
Hide
#Changing variable names and type
animal_protein_consumption <- animal_protein_consumption %>%
mutate(location=as.factor(entity),
period=year,
animal_protein_consumption=share_of_calories_from_animal_protein_fao_2017) %>%
select(location,period,animal_protein_consumption)
#glimpse(animal_protein_consumption)
#Mergining dataframes
total <- merge(total,animal_protein_consumption,by=c("location","period"))
#glimpse(total)
Hide
#source: https://ourworldindata.org/global-education
#Reading file
education_years <- read_csv(here::here('Data',"mean-years-of-schooling-1.csv")) %>%
clean_names()
#Checking for variable types
#glimpse(education_years)
#Changing variable names and type
education_years <- education_years %>%
mutate(location=as.factor(entity),
period=year,
education_years=average_total_years_of_schooling_for_adult_population_lee_lee_2016_barro_lee_2018_and_undp_2
018) %>%
select(location,period,education_years)
#glimpse(education_years)
#Merging dataframes
total <- merge(total,education_years,by=c("location","period"))
Hide
21. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 11/39
1.1.0.15 Adding data for nurses per 1000 people
Rows: 1,542
Columns: 4
$ entity <chr> "Afghanistan", "Afghanistan", "A…
$ code <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ year <dbl> 2005, 2006, 2007, 2008, 2009, 20…
$ nurses_and_midwives_per_1_000_people <dbl> 0.612000, 0.462000, 0.519000, 0.…
Nurses had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.16 Adding data for out-of-pocket expenditure
Rows: 3,002
Columns: 4
$ entity <chr> …
$ code <chr> …
$ year <dbl> …
$ out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure <dbl> …
#source:https://ourworldindata.org/grapher/physicians-per-1000-people
#Reading file
physicians <- read_csv(here::here('Data',"physicians-per-1000-people.csv")) %>%
clean_names()
#Checking for variable types
#glimpse(physicians)
#Changing variable names and type
physicians <- physicians %>%
mutate(location=as.factor(entity),
period=year,
physicians_1000=physicians_per_1_000_people) %>%
select(location,period,physicians_1000)
#glimpse(physicians)
#Merging dataframes
total <- merge(total,physicians,by=c("location","period"))
Hide
#source:https://ourworldindata.org/grapher/nurses-and-midwives-per-1000-people?
#Reading file
nurses <- read_csv(here::here('Data',"nurses-and-midwives-per-1000-people.csv")) %>%
clean_names()
#Checking for variable types
glimpse(nurses)
Hide
#source:https://ourworldindata.org/grapher/out-of-pocket-expenditure-per-capita-on-healthcare
#Reading file
pocket_exp <- read_csv(here::here('Data',"out-of-pocket-expenditure-per-capita-on-healthcare.csv")) %>%
clean_names()
#Checking for variable types
glimpse(pocket_exp)
Hide
22. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 12/39
1.1.0.17 Adding data for health protection coverage
Rows: 162
Columns: 4
$ entity <chr> "Albania", "…
$ code <chr> "ALB", "DZA"…
$ year <dbl> 2008, 2005, …
$ share_of_population_covered_by_health_insurance_ilo_2014 <dbl> 23.6, 85.2, …
Health coverage had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.18 Adding data for literacy rate
Rows: 215
Columns: 4
$ entity <chr> "Afghanistan", "Albania", "Algeria", …
$ code <chr> "AFG", "ALB", "DZA", "ASM", "AND", "A…
$ year <dbl> 2000, 2011, 2006, 1980, 2011, 2011, 1…
$ literacy_rate_cia_factbook_2016 <dbl> 28.1, 96.8, 72.6, 97.0, 100.0, 70.4, …
Literacy had too little incidences. Thus, it was not included in our nal dataset.
1.1.0.19 Adding data for grouping locations into continents
Rows: 194
Columns: 2
$ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa",…
$ country <chr> "Algeria", "Angola", "Benin", "Botswana", "Burkina", "Burun…
#Changing variable names and type
pocket_exp <- pocket_exp %>%
mutate(location=as.factor(entity),
period=year,
pocket_per_cap=out_of_pocket_expenditure_per_capita_on_healthcare_ppp_usd_who_global_health_expenditure) %>%
select(location,period,pocket_per_cap)
#Merging dataframes
total <- merge(total,pocket_exp,by=c("location","period"))
Hide
#Reading file
health_protect <- read_csv(here::here('Data',"health-protection-coverage.csv")) %>%
clean_names()
#Checking for variable types
glimpse(health_protect)
Hide
#Reading file
literacy <- read_csv(here::here('Data',"literacy-rate-by-country.csv")) %>%
clean_names()
#Checking for variable types
glimpse(literacy)
Hide
#source: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv
#Reading file
continents <- read_csv(here::here('Data',"Continents.csv")) %>%
clean_names()
#Checking for variable types
glimpse(continents)
23. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 13/39
Rows: 194
Columns: 2
$ location <fct> Algeria, Angola, Benin, Botswana, Burkina, Burundi, Cameroo…
$ continent <fct> Africa, Africa, Africa, Africa, Africa, Africa, Africa, Afr…
1.1.0.20 Dealing with NAs
After including all potentially-relevant and signi cant variables into our dataset, an inital exploration of the data was
made.
1.2 Exploratory Data Analsys
1.2.0.1 DALY Rates per Continent
Hide
#Changing variable names and type
continents <- continents %>%
mutate(location=as.factor(country),
continent=as.factor(continent))%>%
select(location, continent)
glimpse(continents)
Hide
#Merging dataframes
total <- merge(total,continents,by=c("location"))
Hide
#Adding variables of per capita healthcare expenditure - per capita gdp
total <- total%>%
mutate(healthcare_gdp_rate = healthcare_expenditure/gdp)
#skim(total)
total <- total %>%
na.omit()
#skim(total)
Hide
#Selecting data only from 1980 - onward (to gain better insights on the recent situation)
total_short <-total %>%
filter(period>=1980)
#Re-coding DALY variables as averages per continent, per year
total_cont<-total_short%>%
group_by(period,continent)%>%
summarise(daly_adjusted=mean(daly_adjusted/100000), daly_cnmnd = mean(daly_cnmnd/100000), daly_ncds = mean(daly_ncd
s/100000), daly_ivsa = mean(daly_ivsa/10000))
#Plotting for average DALY rates per capita accumulated from 1980 to 2017
ggplot(total_cont, aes(x = continent, y = daly_adjusted, fill = continent)) +
geom_bar(stat = "identity") +
labs(x= "Continent", y = "Overall DALYs", title = "Accumulated Average DALYs per Capita, per Continent 1980 - 2
017")
24. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 14/39
Hide
ggplot(total_cont, aes(x = continent, y = daly_cnmnd, fill = continent)) +
geom_bar(stat = "identity")+
labs(x= "Continent", y = "Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from Comm
unicable Diseases, per Continent 1980 - 2017")
25. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 15/39
Hide
ggplot(total_cont, aes(x = continent, y = daly_ncds, fill = continent)) +
geom_bar(stat = "identity")+
labs(x= "Continent", y = "Non-Communicable Diseases DALYs", title = "Accumulated Average DALYs per Capita from
Non-Communicable Diseases, per Continent 1980 - 2017")
26. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 16/39
Hide
ggplot(total_cont, aes(x = continent, y = daly_ivsa, fill = continent)) +
geom_bar(stat = "identity")+
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Con
tinent 1980 - 2017")
27. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 17/39
Overall, we nd that Africa has the highest accumulated average DALY rate per capita of all countries (c 90), followed
by Asia (c 50), and Oceania (c 40). The high contrast of Africa agaist the rest of the continents is mainly due to its high
accumulated average for communicable diseases. In this category, Africa more than tripples the second highest
continent (c 55 for Africa compared to c 17 for Asia).
When it comes to non-communicable diseases and injuries, rates are fairly even. For non-communicable diseases, DALY
rates range c 27 - 33 (North America being the lowest and Africa, the highest). Although with much lower DALY rates,
injuriy rates range c 4 - 6 (Europe being the lowest and Africa, the highest).
Consequently, communicable diseases are found to have the highest burden in the population, with Africa taking (or
having taken) the highest burden. A closer look into these rates were taken to better understand its evolution throught
time.
1.2.1 Communicable Diseases
Hide
graph1 <- total_cont %>%
ggplot(aes(x=period, y=daly_cnmnd, fill=continent, text=continent)) +
geom_area(alpha = 1) +
theme(legend.position="none") +
ggtitle(".") +
theme(legend.position="none") +
labs(x= "Year", y = "DALY for communicable disease", title = "Time Series Average DALYs per Capita from Communica
ble Diseases per Continent")
ggplotly(graph1)
Time Series Average DALYs per Capita from Communicable Diseases per Continent
29. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 19/39
As seen from the graph, Africa’s communicable DALY rate seems to be in the decline since 2008.However, this
continent has been consistently ranking high over other continents which leaves room to further consider the causes
and potential solutions.
From the Our World in Data report, it is found that neonatal disorders are the top communicable diseases in terms of
total share of burden (7.45% of all causes). It is also known that there is a strong negative correlation between GDP and
DALY from communicable diseases. Similarly, a negative correlation is found between health expenditure per capita
and DALY from communicable diseases.
What about healthcare expenditure as percentage of GDP?
Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_cnmnd, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Communicable Diseases", title = "Rates
due to Proportion of GDP spent on Healthcare")
Hide
# No clear correlation yet, but interesting
Hide
total_short%>%
select(daly_cnmnd, healthcare_gdp_rate, gdp, pocket_per_cap)%>%
ggpairs()
30. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 20/39
> A higher GDP per country seems to have a signi cant negative correlation to DALY of communicable diseases. However, the proportion of GDP used for
healthcare seems to have a signi cant positive correlation to DALY of communicable diseases. GDP seems to have a signi cant negative correlation to the
proportion of GDP spent on healthcare. This could indicate that poorer countries have a higher likelihood of having to combat communicable diseases.
Consequently, they spend a greater proportion of their GDP on healthcare than richer countries. Out of pocket expenditure is also highly negatively
correlated to DALY of communicable diseases, although highly positively correlated to gdp. This leads to the interpretation that poor countries in which the
population is individually responsible for investing in their medical care and are most likely to have higher DALY communicable disease rates.
1.2.2 Injuries
With DALY rates for injuries and additional causes having similar rates across all continents, we decided to rst take a
closer look at which types of causes were most prominent overall.
Hide
#This plot shows injury related DALY in a stacked bar chart.
start <- total%>%
group_by(continent)%>%
summarise(daly_conflict_terrorism = mean(daly_conflict_terrorism/total_population), daly_self_harm = mean(daly_self
_harm/total_population), daly_violence = mean(daly_violence/total_population), daly_transport_injuries = mean
(daly_transport_injuries/total_population), daly_nature_forces = mean(daly_nature_forces/total_population), d
aly_unintentional_injuries = mean(daly_unintentional_injuries/total_population))
pivot <- pivot_longer(start, cols=c(daly_conflict_terrorism, daly_self_harm, daly_violence,daly_transport_injuries, d
aly_unintentional_injuries, daly_nature_forces), names_to = "diseases",values_to = "value")
#select columns from dataset
plots <- pivot %>%
select(continent,diseases,value)
ggplot(plots, aes(fill=diseases, y=value, x=continent)) +
geom_bar(position="stack", stat="identity") +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Injuries, per Conti
nent 1980 - 2017")
31. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 21/39
Hide
#Plot on Terrorism and Violence
terrorism_violence <- start %>%
select(daly_conflict_terrorism, daly_violence, continent)
terrorism_violence <- pivot_longer(terrorism_violence,c(daly_conflict_terrorism, daly_violence,
),names_to = "diseases",values_to = "value")
#select columns from dataset
terrorism_violence <- terrorism_violence%>%
select(diseases,value,continent)
#stacked bar chart
ggplot(terrorism_violence, aes(fill=diseases, y=value, x=continent)) +
geom_bar(position="stack", stat="identity") +
labs(x= "Continent", y = "Injuries DALYs", title = "Accumulated Average DALYs per Capita from Terrorism and Viole
nce 1980 - 2017")
34. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 24/39
start1 <- total%>%
group_by(continent)%>%
summarise(daly_cvs = mean(daly_cvs/total_population), daly_nutritional_deficiencies = mean(daly_nutritional_deficie
ncies/total_population), daly_maternal_disorders = mean(daly_maternal_disorders/total_population), daly_muscu
loskeletal = mean(daly_musculoskeletal/total_population), daly_other_non_communicable = mean(daly_other_non_c
ommunicable/total_population), daly_neurological = mean(daly_neurological/total_population), daly_mental_and_
substance = mean(daly_mental_and_substance/total_population), daly_diabetes_urogenital_blood_endocrine = mean
(daly_diabetes_urogenital_blood_endocrine/ total_population), daly_neoplasms = mean(daly_neoplasms/total_popu
lation), daly_chronic_liver = mean(daly_chronic_liver/total_population))
pivot1 <- pivot_longer(start1, c(daly_cvs,daly_nutritional_deficiencies,daly_maternal_disorders,daly_musculoskeletal,
daly_other_non_communicable,daly_neurological,daly_mental_and_substance,daly_diabetes_urogenital_blood_endocr
ine,daly_neoplasms,daly_chronic_liver), names_to = "diseases",values_to = "value")
#select columns from data set
total_short_ncds <- pivot1%>%
select(continent,diseases,value)
#stacked bar chart
# This staked bar chart shows the DALY once again for non communicable diseases but has been adjusted to show data fo
r per 100000 population. Additionally the data has been colored to show the different categories of non-commu
nicable diseases.
#Asia has the highest DALY for non communicable diseases closely followed by Europe. There are reasons to suggest why
DALY remains high in both regions. For Asia, the lack of affordability, lack of doctors, and having helathcar
e not to the highest standards may all contribute towards this. Due to Europe's aging population, non-communi
cable diseases are more likely to be present among its population. As seen in the graphs earlier, a path of n
ations to become modern and developed, their population transitions from suffering from communicable disease
towards non-communicable disease, which come with age.
ggplot(total_short_ncds, aes(fill=diseases, y=value, x=continent)) +
geom_bar(position="stack", stat="identity") +
labs(x= "Continent", y = "Non-Comm DALYs", title = "Accumulated Average DALYs per Capita from Non-Comm, per Conti
nent 1980 - 2017")
Hide
35. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 25/39
# Looking into CVS in more detail.
ggplot(total_short, aes(x= continent, y = daly_cvs))+
geom_col()+
labs(x= "Continent", y = "Daly due to CVS related conditions", title = "DALY per capita due to CVS condition
s per continent")
Hide
# Looking into neoplasms in more detail.
ggplot(total_short, aes(x= continent, y = daly_neoplasms))+
geom_col()+
labs(x= "Continent", y = "Daly due to neoplasm", title = "DALY per capita due to neoplasms per continent")
36. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 26/39
Hide
# Looking into diabetes, urogenital, blood, endocrine in more detail.
ggplot(total_short, aes(x= continent, y = daly_diabetes_urogenital_blood_endocrine))+
geom_col()+
labs(x= "Continent", y = "Daly due to diabetes, urogenital, blood and endocrine related conditions.", title
= "DALY per capita due to diabetes, urogenital, blood and endocrine related conditions per continent")
37. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 27/39
Hide
ggplot(total_short, aes(x= continent, y = daly_mental_and_substance))+
geom_col()+
labs(x= "Continent", y = "Daly due to mental and substance related conditions.", title = "DALY per capita du
e to mental and substance related conditions per continent")
38. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 28/39
Hide
ggplot(total_short, aes(x = healthcare_gdp_rate, y = daly_ncds/100000, color = continent))+
geom_point()+
labs(x= "Healthcare Expenditure as percentage of GDP", y = "DALY from Non- Communicable Diseases", title =
"Rates due to Proportion of GDP spent on Healthcare")
40. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 30/39
1.3 Regression analysis
Although highly complex, and with many different societal and economical variables affecting the nal DALY rates, we
decided to look into certain variables that had enough data to be used for our analysis. These variables affecting both,
DALY rates by cause and general DALY rates, can be divided in several categories.
Diet habit variables (fruit consumption per capita per year, percentage of animal protein consumption out of total daily
calories, vegetable consumption percentage of population being overweight), healthcare variables (annual healtcare
expenditure, out of pocket expenditure on healthcare, healthcare per gdp, and number of physicians per 1,000 people),
living habits (smoking percentages), other demographics (education years).
In addition to these elements, we considered the effect of each continent separately by tranforming them into dummy
variables.
1.3.0.1 Models 0 and 1
Hide
#Transforming continent factors into dummy variables
total=total%>%
mutate(Asia=case_when(total$continent=="Asia"~1,TRUE~0))%>%
mutate(Europe=case_when(total$continent=="Europe"~1,TRUE~0))%>%
mutate(NorthA=case_when(total$continent=="North America"~1,TRUE~0))%>%
mutate(Africa=case_when(total$continent=="Africa"~1,TRUE~0))%>%
mutate(SouthA=case_when(total$continent=="South America"~1,TRUE~0))
Hide
41. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 31/39
Call:
lm(formula = daly_adjusted ~ smoking_percentage + percentage_overweight +
fruit_consumption + vegetable_consumption + animal_protein_consumption +
education_years + physicians_1000 + pocket_per_cap + healthcare_gdp_rate +
daly_ivsa + daly_ncds + daly_cnmnd, data = total, subset = gdp)
Residuals:
Min 1Q Median 3Q Max
-6.602e-11 -4.258e-12 -5.730e-13 2.894e-12 1.220e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.867e-12 1.064e-11 -3.640e-01 0.716590
smoking_percentage -1.016e-10 1.900e-11 -5.346e+00 2.49e-07 ***
percentage_overweight -6.735e-13 1.006e-13 -6.694e+00 2.26e-10 ***
fruit_consumption -7.039e-14 2.012e-14 -3.499e+00 0.000579 ***
vegetable_consumption -1.385e-14 2.245e-14 -6.170e-01 0.537948
animal_protein_consumption 5.153e-12 7.264e-13 7.094e+00 2.35e-11 ***
education_years 1.769e-12 6.270e-13 2.821e+00 0.005277 **
physicians_1000 3.200e-12 1.696e-12 1.887e+00 0.060675 .
pocket_per_cap -2.376e-14 8.464e-15 -2.807e+00 0.005506 **
healthcare_gdp_rate 3.037e-11 1.838e-11 1.652e+00 0.100074
daly_ivsa 1.000e+00 1.045e-15 9.570e+14 < 2e-16 ***
daly_ncds 1.000e+00 3.813e-16 2.623e+15 < 2e-16 ***
daly_cnmnd 1.000e+00 1.157e-16 8.645e+15 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.446e-11 on 195 degrees of freedom
(817 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.025e+31 on 12 and 195 DF, p-value: < 2.2e-16
# Lm0 was created to show that daly_ivsa, daly_ncds and daly_cnmnd make up daly_adjusted. As a result, these three va
riables are not included in the linear models.
lm0= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + daly_ivsa + dal
y_ncds + daly_cnmnd, gdp, data = total)
summary(lm0)
Hide
lm1= lm(daly_adjusted ~ smoking_percentage+ percentage_overweight+ fruit_consumption+ vegetable_consumption+ animal_p
rotein_consumption+ education_years+ physicians_1000+ pocket_per_cap+ healthcare_gdp_rate + gdp, data = tota
l)
summary(lm1)
42. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 32/39
Call:
lm(formula = daly_adjusted ~ smoking_percentage + percentage_overweight +
fruit_consumption + vegetable_consumption + animal_protein_consumption +
education_years + physicians_1000 + pocket_per_cap + healthcare_gdp_rate +
gdp, data = total)
Residuals:
Min 1Q Median 3Q Max
-25122 -6233 -812 4866 59542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.921e+04 1.894e+03 41.814 < 2e-16 ***
smoking_percentage -2.391e+04 6.139e+03 -3.894 0.000105 ***
percentage_overweight -2.590e+02 3.531e+01 -7.335 4.53e-13 ***
fruit_consumption -5.051e+01 8.655e+00 -5.836 7.19e-09 ***
vegetable_consumption -3.828e+01 6.980e+00 -5.484 5.26e-08 ***
animal_protein_consumption -1.799e+03 2.719e+02 -6.615 6.00e-11 ***
education_years -1.270e+03 2.027e+02 -6.268 5.41e-10 ***
physicians_1000 7.964e+02 4.986e+02 1.597 0.110538
pocket_per_cap -1.011e+01 2.508e+00 -4.031 5.96e-05 ***
healthcare_gdp_rate 6.335e+03 6.803e+03 0.931 0.351968
gdp 6.325e-02 3.290e-02 1.923 0.054808 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11340 on 1014 degrees of freedom
Multiple R-squared: 0.6371, Adjusted R-squared: 0.6335
F-statistic: 178 on 10 and 1014 DF, p-value: < 2.2e-16
Already from model one we reach an adjusted R-squared of 0.6335, meaning these factors can explain approximately
63 percent of general DALY’s uctuation. The variable with the highest p value was dropped sequentially for the below
models.
Hide
lm2 = lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+
education_years+ physicians_1000+ pocket_per_cap+ fruit_consumption + gdp, data = total)
summary(lm2)
44. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 34/39
Call:
lm(formula = daly_adjusted ~ smoking_percentage + percentage_overweight +
vegetable_consumption + animal_protein_consumption + education_years +
pocket_per_cap + fruit_consumption, data = total)
Residuals:
Min 1Q Median 3Q Max
-25602 -6210 -408 4778 59363
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78595.709 1259.974 62.379 < 2e-16 ***
smoking_percentage -21927.213 5986.815 -3.663 0.000263 ***
percentage_overweight -256.075 34.687 -7.382 3.23e-13 ***
vegetable_consumption -37.819 6.737 -5.613 2.56e-08 ***
animal_protein_consumption -1623.099 249.729 -6.499 1.26e-10 ***
education_years -1107.609 190.699 -5.808 8.44e-09 ***
pocket_per_cap -6.709 1.996 -3.361 0.000806 ***
fruit_consumption -49.239 8.384 -5.873 5.80e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11360 on 1017 degrees of freedom
Multiple R-squared: 0.6346, Adjusted R-squared: 0.632
F-statistic: 252.3 on 7 and 1017 DF, p-value: < 2.2e-16
All variables are now sigi cant, leading to a model with 0.632 as its adjusted R-squared.
1.3.1 Stepwise regression& VIF exam
We can also used stepwise regression to nd the optimal model.Stepwise method is more precise than dropping
variables mannually since it provides the possibility of adding the dropped variables back in the future steps if it
improves the model(lowers model’s AIC),and also examines the signi cance after adding or dropping variables.
lm4=lm( daly_adjusted~smoking_percentage+ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ e
ducation_years+ pocket_per_cap+ fruit_consumption, data = total)
summary(lm4)
Hide
fit1_step=step(lm1,direction="both")
46. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 36/39
smoking_percentage percentage_overweight
1.668205 2.913117
fruit_consumption vegetable_consumption
1.425008 1.502601
animal_protein_consumption education_years
3.110474 3.088943
physicians_1000 pocket_per_cap
3.754590 3.210832
gdp
2.999374
From the nal result we can see that six variables are signi cant with a p-value lower than 0.1. Expense and
pocket_per_cap are both signi cant in this case. However, dropping one of them may lead to insigni cance of the other.
This could be because these two have a joint effect on the burden of disease. We can choose from these two models
according to our con dence interval.
Continents were also considered as part of the model to see their effect.
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
Min 1Q Median 3Q Max
-29935 -4148 -431 3996 50866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
percentage_overweight -202.369 33.403 -6.058 1.94e-09 ***
vegetable_consumption -33.863 6.185 -5.475 5.51e-08 ***
animal_protein_consumption -1030.842 209.884 -4.911 1.05e-06 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
fruit_consumption -40.857 6.824 -5.987 2.96e-09 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9298 on 1013 degrees of freedom
Multiple R-squared: 0.7562, Adjusted R-squared: 0.7535
F-statistic: 285.6 on 11 and 1013 DF, p-value: < 2.2e-16
Hide
vif(fit1_step)
Hide
fit = lm(daly_adjusted~ percentage_overweight+ vegetable_consumption+ animal_protein_consumption+ education_years+ po
cket_per_cap+ fruit_consumption+Asia+Africa+NorthA+Europe+SouthA+ Asia+ Africa+ NorthA+ Europe+ SouthA, data
= total)
print(summary(fit))
Hide
print(vif(fit))
47. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 37/39
percentage_overweight vegetable_consumption
3.908936 1.772149
animal_protein_consumption education_years
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
1.3.2 Interpretation on the nal model
Our nal model had 11 variables
Call:
lm(formula = daly_adjusted ~ percentage_overweight + vegetable_consumption +
animal_protein_consumption + education_years + pocket_per_cap +
fruit_consumption + Asia + Africa + NorthA + Europe + SouthA +
Asia + Africa + NorthA + Europe + SouthA, data = total)
Residuals:
Min 1Q Median 3Q Max
-29935 -4148 -431 3996 50866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66522.998 2345.279 28.365 < 2e-16 ***
percentage_overweight -202.369 33.403 -6.058 1.94e-09 ***
vegetable_consumption -33.863 6.185 -5.475 5.51e-08 ***
animal_protein_consumption -1030.842 209.884 -4.911 1.05e-06 ***
education_years -534.613 165.120 -3.238 0.00124 **
pocket_per_cap -8.669 1.678 -5.168 2.86e-07 ***
fruit_consumption -40.857 6.824 -5.987 2.96e-09 ***
Asia -7140.437 1807.499 -3.950 8.34e-05 ***
Africa 13792.577 1918.746 7.188 1.27e-12 ***
NorthA -9335.463 1794.310 -5.203 2.38e-07 ***
Europe -5196.987 1650.294 -3.149 0.00169 **
SouthA -9146.724 1917.915 -4.769 2.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9298 on 1013 degrees of freedom
Multiple R-squared: 0.7562, Adjusted R-squared: 0.7535
F-statistic: 285.6 on 11 and 1013 DF, p-value: < 2.2e-16
percentage_overweight vegetable_consumption
3.908936 1.772149
animal_protein_consumption education_years
2.885232 3.048594
pocket_per_cap fruit_consumption
2.136245 1.318169
Asia Africa
7.257346 6.590683
NorthA Europe
3.209642 7.556342
SouthA
2.440735
Hide
continent_fit=fit
summary(continent_fit)
Hide
vif(continent_fit)
49. 22/02/2021 CM30_GroupProject_SG30
file:///Users/Aman/Downloads/The Burden of Disease Code.html 39/39
AIC 22060.50 22059.38 22060.42 22061.60 21654.84
*** p < 0.001; ** p < 0.01; * p < 0.05.
actual predicted
actual 1.0000000 0.8572182
predicted 0.8572182 1.0000000
From the 5 models, continent_ t was chosen as the nal model due to having all sigini cant variables, and the highest R
squared (0.76). As it can be seen from our predictions, our model is able to predict the correct DALY rates for 2013 with
85.72 percent accuracy.
Hide
best_model <- continent_fit
#Part 2: We wanted to test the prediction efficacy of our model by ensuring that it was able to predict with a certai
n level of cofidence the DALYS for the last full year of data (2013)
train <- total %>%
filter(period<2013)
predict <- total %>%
filter(period == 2013)
continent_fit2 <- lm(continent_fit, data = train)
final_prediction <- predict(continent_fit2, newdata = predict)
ac_pred <- data.frame(cbind(actual = predict$daly_adjusted, predicted = final_prediction))
correlation_accuracy <- cor(ac_pred)
correlation_accuracy