SlideShare a Scribd company logo
Statistical Analysis on the Factors
Influencing Life Expectancy
Anh Do, Xuemeng Han, Hang Ngo, Jennifer Wong
Agenda
● Background & Problem
● Dataset Description
● Data Pre-Processing
● Model & Variable Selection
● Results
● Conclusion & Lessons Learned
Background & Problem
Analytics question:
“What factors are significant and what model is the best at predicting life
expectancy at birth (LEB)?”
● Analytics goal: predictive accuracy
● Rationale:
○ Accurate prediction helps countries understand whether their investment in social
and economic development is effective
○ Understanding important determinants can help countries allocate resources
appropriately
Dataset Description
● Response variable: LEB (in years)
● Predictors: 22 total variables
○ Economic indicators (GDP, Total Health Expenditure as % of GDP per capita, etc)
○ Health indicators (HIV/AIDS, Vaccine coverage, Obesity, etc)
○ One categorical variable (Status: Developed or Developing country), the rest are numeric
variables
● Data cleaning process:
○ Replace original Life Expectancy data with LEB due to inconsistency with official data
sources (WHO)
○ Replace some predictors with missing values using more complete dataset from reliable
sources (WHO and World Bank)
Data Pre-Processing
● OLS assumptions violated: Errors are heteroskedastic
● OLS assumptions violated: Multicollinearity
Initial After Standardizing
& Centering
1839607969.310 56.997
Variance Inflation FactorsCondition Index
Condition Index: 9.450
Model & Variable Selection
● Parametric models to address heteroskedasticity and dimensionality:
○ Weighted Least Squares
○ Ridge, LASSO
○ Principal Component Regression, Partial Least Squares
● Non parametric models:
○ Regression Tree
○ Random Forest
● Two model specifications: 17 variables (full) and 14 variables (reduced)
○ There is no business restriction to keep all predictors in the model
○ Stepwise and Best Subset were run to select a reduced model
○ Both methods suggest the same set of 14 variables to be included
Results
● Regression tree was chosen as the most accurate model to predict LEB
Final model: Regression tree
● HIV_AIDS is the most important factor in predicting LEB
HIV_AIDS < 0.95
79.47
47.27
Lessons Learned
● Kaggle dataset needs to be inspected carefully for data quality and validity before being
analyzed
● If some data need to be replaced or dropped, it’s important to have clear rationale on the new
data chosen
○ Dropped Hepatitis B data due to too many missing values
○ Replaced BMI with Obesity, and other predictors with more reliable data sources
Thank You
Q & A
Anh Do, Xuemeng Han, Hang Ngo, Jennifer Wong
APPENDIX
Dataset Description
● Economic indicators:
○ Status
○ GDP
○ Population
○ Total Healthcare Expenditure
○ Percentage Expenditure in Healthcare
○ Income Index
○ Years Of Schooling
● Health and Risk Factors:
○ Adult Mortality
○ Infant Deaths
○ Under Five Death
○ Polio
○ Diphtheria
○ Measles
○ HIV/AIDS
○ Thinness (5-9 years old)
○ Thinness (10-19 years old
○ Obesity
○ Alcohol Consumption
Correlation Matrix - Economic Indicators
Correlation Matrix - Health Indicators
Correlation Matrix - InfantDeaths and UnderFiveDeaths
Full Model - OLS Testing
Full Model - WLS
Full Model - Ridge & LASSO
Full Model - PCR
Full Model - PLS
Full Model - Regression Tree
Full Model- Random Forest
Full Model - Random Forest cont.
Variable Selection
Reduced Model - WLS
Reduced Model - Ridge & LASSO
Reduced Model - PCR
Reduced Model - PLS
Reduced Model - Regression Tree
Final Results

More Related Content

What's hot

NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017
Konstantin Savenkov
 
Machine Learning At Tubi
Machine Learning At TubiMachine Learning At Tubi
Machine Learning At Tubi
Jaya Kawale
 
Net Neutrality Complete
Net Neutrality CompleteNet Neutrality Complete
Net Neutrality Complete
Siddhartha Rao
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
Justin Basilico
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
OReillyStrata
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
mahavir_a
 
Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh Discovery
Neo4j
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
Warum Chatbots?
Warum Chatbots?Warum Chatbots?
Warum Chatbots?
OYGO
 
ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023
SaahilThakur
 
International trade analysis Using Tableau visualization
International trade analysis Using Tableau visualizationInternational trade analysis Using Tableau visualization
International trade analysis Using Tableau visualization
Alok Tayal (PMP, PMI-ACP, TOGAF)
 
Hackolade Tutorial - part 1 - What is a data model
Hackolade Tutorial - part 1 - What is a data modelHackolade Tutorial - part 1 - What is a data model
Hackolade Tutorial - part 1 - What is a data model
PascalDesmarets1
 
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Anmol Bhasin
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
ChatGPT.pptx
ChatGPT.pptxChatGPT.pptx
ChatGPT.pptx
AshwiniAshh
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
sathish sak
 
Use of Big Data in Government Sector
Use of Big Data in Government SectorUse of Big Data in Government Sector
Use of Big Data in Government Sector
ijtsrd
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Sudeep Das, Ph.D.
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of Graphs
Neo4j
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
Loic Merckel
 

What's hot (20)

NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017
 
Machine Learning At Tubi
Machine Learning At TubiMachine Learning At Tubi
Machine Learning At Tubi
 
Net Neutrality Complete
Net Neutrality CompleteNet Neutrality Complete
Net Neutrality Complete
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh Discovery
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Warum Chatbots?
Warum Chatbots?Warum Chatbots?
Warum Chatbots?
 
ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023
 
International trade analysis Using Tableau visualization
International trade analysis Using Tableau visualizationInternational trade analysis Using Tableau visualization
International trade analysis Using Tableau visualization
 
Hackolade Tutorial - part 1 - What is a data model
Hackolade Tutorial - part 1 - What is a data modelHackolade Tutorial - part 1 - What is a data model
Hackolade Tutorial - part 1 - What is a data model
 
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...Tutorial on People Recommendations in Social Networks -  ACM RecSys 2013,Hong...
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
ChatGPT.pptx
ChatGPT.pptxChatGPT.pptx
ChatGPT.pptx
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Use of Big Data in Government Sector
Use of Big Data in Government SectorUse of Big Data in Government Sector
Use of Big Data in Government Sector
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of Graphs
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 

Similar to Statistical Analysis on the Factors Influencing Life Expectancy

Analysing the Effectiveness of Government Spending on Health across Countrie
Analysing the Effectiveness of Government Spending on Health across CountrieAnalysing the Effectiveness of Government Spending on Health across Countrie
Analysing the Effectiveness of Government Spending on Health across Countrie
eurosigdoc acm
 
Inside U.S. News Best Hospitals Rankings (Ben Harder)
Inside U.S. News Best Hospitals Rankings (Ben Harder)Inside U.S. News Best Hospitals Rankings (Ben Harder)
Inside U.S. News Best Hospitals Rankings (Ben Harder)
U.S. News Healthcare of Tomorrow
 
Current Healthcare Pulse for 2018
Current Healthcare Pulse for 2018Current Healthcare Pulse for 2018
Current Healthcare Pulse for 2018
Tammy Siragusano
 
2018 hc pulse
2018 hc pulse2018 hc pulse
2018 hc pulse
Tammy Siragusano
 
Open classroom health policy - session 10.16 - iselin and young
Open classroom   health policy - session 10.16 - iselin and youngOpen classroom   health policy - session 10.16 - iselin and young
Open classroom health policy - session 10.16 - iselin and youngBrian Young
 
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
scherala
 
US Healthcare QUALITY_ Dr. Elliot Goodman
US Healthcare QUALITY_ Dr. Elliot GoodmanUS Healthcare QUALITY_ Dr. Elliot Goodman
US Healthcare QUALITY_ Dr. Elliot Goodman
Levi Shapiro
 
Pharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and BeyondPharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and Beyond
Jennifer Hammonds
 
Pharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and BeyondPharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and Beyond
Medpace
 
Outcomes research
Outcomes researchOutcomes research
Outcomes research
alishap702
 
Dr. Bobby Milstein | Beyond Reform and Rebound
Dr. Bobby Milstein | Beyond Reform and ReboundDr. Bobby Milstein | Beyond Reform and Rebound
Dr. Bobby Milstein | Beyond Reform and Rebound
ColumbiaPublicHealth
 
The Promises and Challenges of Gender Data
The Promises and Challenges of Gender DataThe Promises and Challenges of Gender Data
Paying for performance to improve the delivery of health interventions in LMICs
Paying for performance to improve the delivery of health interventions in LMICsPaying for performance to improve the delivery of health interventions in LMICs
Paying for performance to improve the delivery of health interventions in LMICs
ReBUILD for Resilience
 
Local Transformation Plans Review
Local Transformation Plans ReviewLocal Transformation Plans Review
Local Transformation Plans Review
CYP MH
 
Best Target Market of Diabetic Patients - Data Driven Recommendations
Best Target Market of Diabetic Patients - Data Driven RecommendationsBest Target Market of Diabetic Patients - Data Driven Recommendations
Best Target Market of Diabetic Patients - Data Driven Recommendations
Anh Do
 
Introduction to budget impact analysis
Introduction to budget impact analysisIntroduction to budget impact analysis
Introduction to budget impact analysis
NazmiLianaAzmi
 
Рейтинг развития 2016 год
Рейтинг развития 2016 годРейтинг развития 2016 год
Рейтинг развития 2016 год
John Connor
 
Cardiac rehab, telehealth, the evidence for alternatives for ACOs
Cardiac rehab, telehealth, the evidence for alternatives for ACOsCardiac rehab, telehealth, the evidence for alternatives for ACOs
Cardiac rehab, telehealth, the evidence for alternatives for ACOs
Maxpowerjr
 
Future of Health Informatics Intro
Future of Health Informatics IntroFuture of Health Informatics Intro
Future of Health Informatics Intropreitano
 

Similar to Statistical Analysis on the Factors Influencing Life Expectancy (20)

Analysing the Effectiveness of Government Spending on Health across Countrie
Analysing the Effectiveness of Government Spending on Health across CountrieAnalysing the Effectiveness of Government Spending on Health across Countrie
Analysing the Effectiveness of Government Spending on Health across Countrie
 
zhe_CRI2015_NHANES
zhe_CRI2015_NHANESzhe_CRI2015_NHANES
zhe_CRI2015_NHANES
 
Inside U.S. News Best Hospitals Rankings (Ben Harder)
Inside U.S. News Best Hospitals Rankings (Ben Harder)Inside U.S. News Best Hospitals Rankings (Ben Harder)
Inside U.S. News Best Hospitals Rankings (Ben Harder)
 
Current Healthcare Pulse for 2018
Current Healthcare Pulse for 2018Current Healthcare Pulse for 2018
Current Healthcare Pulse for 2018
 
2018 hc pulse
2018 hc pulse2018 hc pulse
2018 hc pulse
 
Open classroom health policy - session 10.16 - iselin and young
Open classroom   health policy - session 10.16 - iselin and youngOpen classroom   health policy - session 10.16 - iselin and young
Open classroom health policy - session 10.16 - iselin and young
 
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
Academy Health- Annual Research Meeting - State Policy Interest Groups- 2013
 
US Healthcare QUALITY_ Dr. Elliot Goodman
US Healthcare QUALITY_ Dr. Elliot GoodmanUS Healthcare QUALITY_ Dr. Elliot Goodman
US Healthcare QUALITY_ Dr. Elliot Goodman
 
Pharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and BeyondPharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and Beyond
 
Pharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and BeyondPharmacoeconomic Assessment through Market Approval and Beyond
Pharmacoeconomic Assessment through Market Approval and Beyond
 
Outcomes research
Outcomes researchOutcomes research
Outcomes research
 
Dr. Bobby Milstein | Beyond Reform and Rebound
Dr. Bobby Milstein | Beyond Reform and ReboundDr. Bobby Milstein | Beyond Reform and Rebound
Dr. Bobby Milstein | Beyond Reform and Rebound
 
The Promises and Challenges of Gender Data
The Promises and Challenges of Gender DataThe Promises and Challenges of Gender Data
The Promises and Challenges of Gender Data
 
Paying for performance to improve the delivery of health interventions in LMICs
Paying for performance to improve the delivery of health interventions in LMICsPaying for performance to improve the delivery of health interventions in LMICs
Paying for performance to improve the delivery of health interventions in LMICs
 
Local Transformation Plans Review
Local Transformation Plans ReviewLocal Transformation Plans Review
Local Transformation Plans Review
 
Best Target Market of Diabetic Patients - Data Driven Recommendations
Best Target Market of Diabetic Patients - Data Driven RecommendationsBest Target Market of Diabetic Patients - Data Driven Recommendations
Best Target Market of Diabetic Patients - Data Driven Recommendations
 
Introduction to budget impact analysis
Introduction to budget impact analysisIntroduction to budget impact analysis
Introduction to budget impact analysis
 
Рейтинг развития 2016 год
Рейтинг развития 2016 годРейтинг развития 2016 год
Рейтинг развития 2016 год
 
Cardiac rehab, telehealth, the evidence for alternatives for ACOs
Cardiac rehab, telehealth, the evidence for alternatives for ACOsCardiac rehab, telehealth, the evidence for alternatives for ACOs
Cardiac rehab, telehealth, the evidence for alternatives for ACOs
 
Future of Health Informatics Intro
Future of Health Informatics IntroFuture of Health Informatics Intro
Future of Health Informatics Intro
 

More from Anh Do

U.S. Asthma Prevalence - Predictive Modeling
U.S. Asthma Prevalence - Predictive ModelingU.S. Asthma Prevalence - Predictive Modeling
U.S. Asthma Prevalence - Predictive Modeling
Anh Do
 
DowDuPont Merger and Acquisition Consulting Package (2017)
DowDuPont Merger and Acquisition Consulting Package (2017)DowDuPont Merger and Acquisition Consulting Package (2017)
DowDuPont Merger and Acquisition Consulting Package (2017)
Anh Do
 
Hershey consulting package
Hershey consulting packageHershey consulting package
Hershey consulting package
Anh Do
 
U.S. Immigration Analysis from 2000 to 2016
U.S. Immigration Analysis from 2000 to 2016U.S. Immigration Analysis from 2000 to 2016
U.S. Immigration Analysis from 2000 to 2016
Anh Do
 
Asthma Prevalence 2016
Asthma Prevalence 2016Asthma Prevalence 2016
Asthma Prevalence 2016
Anh Do
 
DowDuPont Merger and Acquisition consulting package
DowDuPont Merger and Acquisition consulting packageDowDuPont Merger and Acquisition consulting package
DowDuPont Merger and Acquisition consulting package
Anh Do
 
AnhDo_FinalReport
AnhDo_FinalReportAnhDo_FinalReport
AnhDo_FinalReportAnh Do
 

More from Anh Do (7)

U.S. Asthma Prevalence - Predictive Modeling
U.S. Asthma Prevalence - Predictive ModelingU.S. Asthma Prevalence - Predictive Modeling
U.S. Asthma Prevalence - Predictive Modeling
 
DowDuPont Merger and Acquisition Consulting Package (2017)
DowDuPont Merger and Acquisition Consulting Package (2017)DowDuPont Merger and Acquisition Consulting Package (2017)
DowDuPont Merger and Acquisition Consulting Package (2017)
 
Hershey consulting package
Hershey consulting packageHershey consulting package
Hershey consulting package
 
U.S. Immigration Analysis from 2000 to 2016
U.S. Immigration Analysis from 2000 to 2016U.S. Immigration Analysis from 2000 to 2016
U.S. Immigration Analysis from 2000 to 2016
 
Asthma Prevalence 2016
Asthma Prevalence 2016Asthma Prevalence 2016
Asthma Prevalence 2016
 
DowDuPont Merger and Acquisition consulting package
DowDuPont Merger and Acquisition consulting packageDowDuPont Merger and Acquisition consulting package
DowDuPont Merger and Acquisition consulting package
 
AnhDo_FinalReport
AnhDo_FinalReportAnhDo_FinalReport
AnhDo_FinalReport
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Statistical Analysis on the Factors Influencing Life Expectancy

  • 1. Statistical Analysis on the Factors Influencing Life Expectancy Anh Do, Xuemeng Han, Hang Ngo, Jennifer Wong
  • 2. Agenda ● Background & Problem ● Dataset Description ● Data Pre-Processing ● Model & Variable Selection ● Results ● Conclusion & Lessons Learned
  • 3. Background & Problem Analytics question: “What factors are significant and what model is the best at predicting life expectancy at birth (LEB)?” ● Analytics goal: predictive accuracy ● Rationale: ○ Accurate prediction helps countries understand whether their investment in social and economic development is effective ○ Understanding important determinants can help countries allocate resources appropriately
  • 4. Dataset Description ● Response variable: LEB (in years) ● Predictors: 22 total variables ○ Economic indicators (GDP, Total Health Expenditure as % of GDP per capita, etc) ○ Health indicators (HIV/AIDS, Vaccine coverage, Obesity, etc) ○ One categorical variable (Status: Developed or Developing country), the rest are numeric variables ● Data cleaning process: ○ Replace original Life Expectancy data with LEB due to inconsistency with official data sources (WHO) ○ Replace some predictors with missing values using more complete dataset from reliable sources (WHO and World Bank)
  • 5. Data Pre-Processing ● OLS assumptions violated: Errors are heteroskedastic ● OLS assumptions violated: Multicollinearity Initial After Standardizing & Centering 1839607969.310 56.997 Variance Inflation FactorsCondition Index Condition Index: 9.450
  • 6. Model & Variable Selection ● Parametric models to address heteroskedasticity and dimensionality: ○ Weighted Least Squares ○ Ridge, LASSO ○ Principal Component Regression, Partial Least Squares ● Non parametric models: ○ Regression Tree ○ Random Forest ● Two model specifications: 17 variables (full) and 14 variables (reduced) ○ There is no business restriction to keep all predictors in the model ○ Stepwise and Best Subset were run to select a reduced model ○ Both methods suggest the same set of 14 variables to be included
  • 7. Results ● Regression tree was chosen as the most accurate model to predict LEB
  • 8. Final model: Regression tree ● HIV_AIDS is the most important factor in predicting LEB HIV_AIDS < 0.95 79.47 47.27
  • 9. Lessons Learned ● Kaggle dataset needs to be inspected carefully for data quality and validity before being analyzed ● If some data need to be replaced or dropped, it’s important to have clear rationale on the new data chosen ○ Dropped Hepatitis B data due to too many missing values ○ Replaced BMI with Obesity, and other predictors with more reliable data sources
  • 10. Thank You Q & A Anh Do, Xuemeng Han, Hang Ngo, Jennifer Wong
  • 12. Dataset Description ● Economic indicators: ○ Status ○ GDP ○ Population ○ Total Healthcare Expenditure ○ Percentage Expenditure in Healthcare ○ Income Index ○ Years Of Schooling ● Health and Risk Factors: ○ Adult Mortality ○ Infant Deaths ○ Under Five Death ○ Polio ○ Diphtheria ○ Measles ○ HIV/AIDS ○ Thinness (5-9 years old) ○ Thinness (10-19 years old ○ Obesity ○ Alcohol Consumption
  • 13. Correlation Matrix - Economic Indicators
  • 14. Correlation Matrix - Health Indicators
  • 15. Correlation Matrix - InfantDeaths and UnderFiveDeaths
  • 16. Full Model - OLS Testing
  • 18. Full Model - Ridge & LASSO
  • 21. Full Model - Regression Tree
  • 23. Full Model - Random Forest cont.
  • 26. Reduced Model - Ridge & LASSO
  • 29. Reduced Model - Regression Tree

Editor's Notes

  1. Anh: dataset, pre-processing Jennifer: model and variable selection, results Hang: conclusion, and challenge
  2. LEB is country-level statistics monitored by countries and global organizations to evaluate the quality of population health and economic development. Because of modernization and better standard of living, life expectancy worldwide has increased. So now, the increasing LEB is an important improvement on population health. According to this background and the data we collected, our predictive question is what factors are significant, and which predictive model is the best one to predict LEB. And our goal is using different methods to get the most accurate prediction of Life Expectancy, because we think this can help counties to know whether their investment is effective, and what factors are the key determinants to help them to allocate resources appropriately.
  3. The Life Expectancy dataset contains a total of 22 variables from the WHO’s data repository, tracking health and economic variables from 193 countries over 16 years (2000 - 2015). The response variable is LifeExpectancy, which measures average life expectancy (in age) of the population over the years. The original dataset has substantial missing data in some predictors, which are replaced by data from other sources (such as World Bank, Human Development Reports) as noted in Appendix 1. We also remove the variable HepatitisB as it has too many missing values, which will reduce the number of observations in our dataset significantly when we run parametric models. In our dataset, there is one categorical variable (Status: developed or developing country). The rest of the variables are numeric (such as $ GDP per capita, health expenditure as % of GDP per capita, % coverage of some vaccines, etc).
  4. The Breusch-Pagan test of 302.85 was significant, highlighting heteroskedasticity problem. The Condition Index (CI) and Variance Inflation Factors (VIF) suggested the presence of multicollinearity. Centering and standardizing the data reduced the multicollinearity problem significantly, but did not entirely fix it. The CI decreased from 1.8 billion to 57. The main “culprits” for multicollinearity were InfantDeaths and UnderFiveDeaths (both VIFs were close to 300). InfantDeaths was removed from the dataset since UnderFiveDeaths had a higher correlation with LifeExpectancy (Appendix 6). The CI fell to 9.45 and no VIFs exceeded 10.
  5. To address heteroskedasticity, a WLS model was run. For the dimensionality issues, Ridge, LASSO, PCR, and PLS were all included as potential analytics methods. Since our analytic goal is predictive accuracy, trees were also performed and evaluated along with the other models. In terms of the variables, there was no business restriction to keep all of them in the model. When we ran a stepwise method for variable selection, the optimal number of variables was 14. Both the stepwise and best subset methods gave as the same 14 variables to be included in the reduced model.
  6. To perform cross-validation, we did a 60-40 split for the training and test subsamples. All the RMSE for the models were about the same around 0.39. The tree methods were the lowest with 0.389783, and actually we got the exact same results for the full and reduced tree. We suspect this is because the reduced model only had three fewer variables than the full model and these predictors were likely not important anyway when performing the tree splitting algorithm.
  7. Since our analytics goal is prediction accuracy, the regression tree, which has the lowest RMSE, was chosen. The tree model first partitions the data based on number of deaths per 1,000 live births (0-4 years) due to HIV_AIDS. Countries with this rate lower than 0.95 generally have higher life expectancies compared to countries with higher rates, especially if the rate is higher than 5.05. A random forest model was also generated to analyze variable importance and confirmed that HIV_AIDS was the most important variable that contributed to the reduction of MSE when predicting LifeExpectancy (Appendix 9).
  8. First challenge: Kaggle dataset needs to be inspected carefully for data quality and validity before being analyzed. Author quoted WHO data, but we don’t know if the author collected data properly. Once we checked, we realized there were significant missing values in some predictors. If we used this original dataset, a lot of valuable observations would be excluded from parametric models such as OLS, WLS. There were also major inputting errors in some variables. Second challenge: If some data needs to be replaced or dropped, it’s important to have clear rationale on the new data chosen. 1st example: Dropped Hepatitis B variable because there was systematic missing values. Some developed countries did not administer Hepatitis B vaccines until recently (2017, 2018), therefore didn’t collect data on this vaccine coverage from earlier periods. 2nd example: Replaced original BMI data due to inputting errors. A person with BMI of over 30 is considered obese, while this data contains a significant number of country average BMI of 80 and above. BMI over the years in some countries also fluctuates by 10 times, which was unreasonable. We decided to replace this variable with data from a more reliable source (directly from WHO), with data on percentage of population with BMI of over 30 kg/m2. We included this variable because we think that obesity is a relevant factor in predicting LEB.
  9. The Life Expectancy dataset contains a total of 22 variables from the WHO’s data repository, tracking health and economic variables from 193 countries over 16 years (2000 - 2015). The response variable is LifeExpectancy, which measures average life expectancy (in age) of the population over the years. The original dataset has substantial missing data in some predictors, which are replaced by data from other sources (such as World Bank, Human Development Reports) as noted in Appendix 1. We also remove the variable HepatitisB as it has too many missing values, which will reduce the number of observations in our dataset significantly when we run parametric models. In our dataset, there is one categorical variable (Status: developed or developing country). The rest of the variables are numeric (such as $ GDP per capita, health expenditure as % of GDP per capita, % coverage of some vaccines, etc).
  10. The Life Expectancy dataset contains a total of 22 variables from the WHO’s data repository, tracking health and economic variables from 193 countries over 16 years (2000 - 2015). The response variable is LifeExpectancy, which measures average life expectancy (in age) of the population over the years. The original dataset has substantial missing data in some predictors, which are replaced by data from other sources (such as World Bank, Human Development Reports) as noted in Appendix 1. We also remove the variable HepatitisB as it has too many missing values, which will reduce the number of observations in our dataset significantly when we run parametric models. In our dataset, there is one categorical variable (Status: developed or developing country). The rest of the variables are numeric (such as $ GDP per capita, health expenditure as % of GDP per capita, % coverage of some vaccines, etc).
  11. The Life Expectancy dataset contains a total of 22 variables from the WHO’s data repository, tracking health and economic variables from 193 countries over 16 years (2000 - 2015). The response variable is LifeExpectancy, which measures average life expectancy (in age) of the population over the years. The original dataset has substantial missing data in some predictors, which are replaced by data from other sources (such as World Bank, Human Development Reports) as noted in Appendix 1. We also remove the variable HepatitisB as it has too many missing values, which will reduce the number of observations in our dataset significantly when we run parametric models. In our dataset, there is one categorical variable (Status: developed or developing country). The rest of the variables are numeric (such as $ GDP per capita, health expenditure as % of GDP per capita, % coverage of some vaccines, etc).
  12. The Life Expectancy dataset contains a total of 22 variables from the WHO’s data repository, tracking health and economic variables from 193 countries over 16 years (2000 - 2015). The response variable is LifeExpectancy, which measures average life expectancy (in age) of the population over the years. The original dataset has substantial missing data in some predictors, which are replaced by data from other sources (such as World Bank, Human Development Reports) as noted in Appendix 1. We also remove the variable HepatitisB as it has too many missing values, which will reduce the number of observations in our dataset significantly when we run parametric models. In our dataset, there is one categorical variable (Status: developed or developing country). The rest of the variables are numeric (such as $ GDP per capita, health expenditure as % of GDP per capita, % coverage of some vaccines, etc).