How to use Logistic Regression in GIS using ArcGIS and R statistics
Regression Analysis Project
1. Life Expectancy by Country
Regression Analysis Project
Michael Wallace & Brandon Berube
12/12/2013
For our project,mypartner,BrandonBerube andI decidedtouse regressionanalysistosee whattype
of variables affecthowlonga personwill live due tothe countrythat theyinhibit.We were ableto
utilize variousvariable screeningmethodsinordertorun an effectivemodelthatallowedustonotonly
make inferencesabouthowlongpeople of acountrywill live,butalsoshow how accurate ourmodel is.
2. There were 20 RandomlySelectedCountries.
- We randomlyselectedthe countriesbygivingeachcountrya numberfrom1-213 and using
Random.orgto pick20 randomnumberswhichrelate tothe correspondingcountries.
1. Iraq
2. Mongolia
3. Lesotho
4. Canada
5. Mauritius
6. Oman
7. Samoa
8. Mali
9. Bangladesh
10. Suriname
11. Tonga
12. Qatar
13. Bulgaria
14. Micronesia,Fed.Sts.
15. Spain
16. TrinidadAndTobago
17. PapuaNewGuinea
18. Tanzania
19. Austria
20. Sao Tome & Principe
- We pickedvariablesthat we assume wouldaffectlife expectancythe most.
Response Variable (Y):Life ExpectancyatBirth,Male and Female
Life expectancyatbirthindicatesthe numberof yearsanewborninfant
wouldlive if prevailingpatternsof mortalityatthe time of itsbirthwere to
stay the same throughoutitslife.
http://data.worldbank.org/indicator/SP.DYN.LE00.IN
X1 : Accessto improvedsanitationfacilities(measuredas% of total population)
percentage of the populationusing improvedsanitationfacilities whichare
flush/pourfacilities
http://data.worldbank.org/indicator/SH.STA.ACSN.UR/countries/1W?displa
y=default
X2 : Healthexpenditure percapita –(measuredincurrentUS dollars)
3. Sumof publicandprivate healthexpendituresasa ratioof total population
http://data.worldbank.org/indicator/SH.XPD.PCAP
X3 Accessto improvedwatersource (measuredas% of total population)
Percentage of the populationusingan improveddrinkingwatersource.
Improvedwaterincludespipedwater,protecteddugwells,andprotected
springs.
http://data.worldbank.org/indicator/SH.H2O.SAFE.ZS
X4 FoodProductionIndex
Foodproductionindex coversfoodcropsthatare considerededible and
that containnutrients.
http://data.worldbank.org/indicator/AG.PRD.FOOD.XD
X5 : AirQuality(CO2emissions(kt))
CarbonDioxide emissionsare burningof fossil fuelsandthe manufacture of
cement.
http://data.worldbank.org/indicator/EN.ATM.CO2E.KT/countries
1 If greateror equal to 62938.95
0 if lessthan62938.95
- NOTE*: These variablesare notlistedinorderof importance;because intuitiondoesnot
resultinappropriate mathematical procedure,until we utilize stepwise regression, we are
unable todetermine whichvariable are useful.
- We startedto insertourdata for our variables.We ranintoan issue withsome countriesnot
havingthe mostrecentinformation,soinorderto keepasample size of 20, we disregarded
the countriesthatdidnot have complete data andpicked replacementcountriesrandomly.
1. In orderto projecta complete regressionanalysisexperience,we are disregarding
informationwithoutcomplete datasets.
- To clarifyour variables,we writethemas:
X1 = SanitationScore
X2 = healthcare
X3 = Water Quality
X4 = FoodProductionIndex
X5 = Air Quality
4. 1 If greateror equal to 62938.95
0 if lessthan62938.95
Model One: Linear First Order
We are usinga basicmodel,justtosee how we landas a startingpoint.
Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 = 0 (Model Isn’tUseful)
Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 +β5X5 ≠ 0 (Model Has Utility)
The fittedmodel is:
Y= β0 + β1X1 + β2X2 + β3X3 + β4X4
Where
Y= Life Expectancy
β0 = Intercept
β1 =Coefficientforsanitationscore
β2 = Coefficientforhealthcare productionindex
β3 =Coefficientforaccesstoimprovedwaterquality
β4 = Coefficientforfoodproductionindex
The regression equation is
Life Expectancy = 36.2 + 0.233 Sanitation Score + 0.00139 Healthcare
+ 0.0632 Water Quality + 0.0763 Food
Predictor Coef SE Coef T P
Constant 36.16 11.79 3.07 0.008
Sanitation Score 0.23250 0.05535 4.20 0.001
Healthcare 0.0013937 0.0006336 2.20 0.044
Water Quality 0.06319 0.08017 0.79 0.443
Food 0.07626 0.07634 1.00 0.334
S = 4.02304 R-Sq = 83.3% R-Sq(adj) = 78.9%
Analysis of Variance
Source DF SS MS F P
Regression 4 1212.43 303.11 18.73 0.000
Residual Error 15 242.77 16.18
Total 19 1455.20
Important observations:
1. Our overall model’sp-value is.000
2. Our adjustedR2
ais78.9%
3. We rejectourH0, whichstatesthat our model isuseful inpredictinglife expectancyof
5. Model Two: Higher Order Model with Qualitative variable
Here we believe thatwe cansee some possibletrendswhenwe plotpointsonascatter plot,
thuswe decidedonincorporating higherorderterms.
Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X3X4 + β6X5 + β7X3
2
+ β8X2
2
= 0 (Model Isn’tUseful)
Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 +β5X3X4+β6X5 + β7X3
2
+ β8X2
2
≠ 0 (Model Has Utility)
The fittedmodel is:
Y= β0 + β1X1 + β2X2 + β3X3 + β4X4
Where
Y= Life Expectancy
β0 = Intercept
β1 =CoefficientforSanitation Score
Β2 = CoefficientforHealthcare ProductionIndex
Β3 =CoefficientforAccesstoImprovedWaterQuality
Β4 = CoefficientforFoodProductionIndex
Β5 =CoefficientforInteractionof FoodProductionIndex andWaterQuality
Β6 = CoefficientforAirQuality
Β7 = CoefficientforWaterQualitySquared
Β8 = CoefficientforHealthcare Squared
Note* we are utilizingAirQualityasaqualitative variable.
We have calculatedthe average CO2
emissionsof the sample,anddecidedthatitwouldbe a good
qualitative variable.The average numberwas 62938.95, thus anycountry witha numbergreaterthanor
equal tothe average will getrepresentedbya‘1’, and anynumberwitha value lowerthan62938.95 will
be witha 0.
- 1 if X ≥ 62938.95
- 0 if X < 62938.95
The regression equation is
Life Expectancy = 67.7 + 0.185 Sanitation Score + 0.00080 Healthcare
- 0.927 Water Quality + 0.078 Food + 0.00045 Food*Water
+ 1.28 Air Quality + 0.00686 Waterquality ^2 + 0.000000 Healthcare
^2
6. Predictor Coef SE Coef T P
Constant 67.70 53.76 1.26 0.234
Sanitation Score 0.18515 0.06870 2.69 0.021
Healthcare 0.000800 0.004100 0.20 0.849
Water Quality -0.9275 0.8215 -1.13 0.283
Food 0.0784 0.4459 0.18 0.864
Food*Water 0.000445 0.005229 0.09 0.934
Air Quality 1.279 4.078 0.31 0.760
Waterquality ^2 0.006860 0.004546 1.51 0.160
Healthcare ^2 0.00000000 0.00000063 0.00 0.998
S = 4.24554 R-Sq = 86.4% R-Sq(adj) = 76.5%
Analysis of Variance
Source DF SS MS F P
Regression 8 1256.93 157.12 8.72 0.001
Residual Error 11 198.27 18.02
Total 19 1455.20
Important observations:
1. Our overall model’sp-value is.001,whichisan increase fromthe othermodel.
2. Our adjustedR2
ais76.5%, whichislowerthanour initial firstorderlinearmodel.
3. We rejectourH0, whichstatesthat our model isuseful inpredictinglife expectancy
Model Three: Reduced Model with Qualitative variable
We are usinga nested(reduced) model,totryto be more straightforward,due tothe fact thatour R2
a
has dropped.
Ho: β0 + β1X1 + β2X2 + β3X3 + β4X4 β5 X3X4 + β6X5 = 0 (Model Isn’tUseful)
Ha: β0 + β1X1 + β2X2 + β3X3 + β4X4 β5X5 + β6X6 ≠0 (Model Has Utility)
The fittedmodel is:
Y= β0 + β1X1 + β2X2 + β3X3 + β4X4 β5X5 + β6X6
The regression equation is
Life Expectancy = 49.0 + 0.228 Sanitation Score + 0.00131 Healthcare
- 0.082 Water Quality - 0.033 Food + 0.00128 Food*Water
+ 0.47 Air Quality
7. Predictor Coef SE Coef T P
Constant 49.00 53.12 0.92 0.373
Sanitation Score 0.22756 0.06163 3.69 0.003
Healthcare 0.0013147 0.0009828 1.34 0.204
Water Quality -0.0821 0.6107 -0.13 0.895
Food -0.0325 0.4462 -0.07 0.943
Food*Water 0.001275 0.005276 0.24 0.813
Air Quality 0.470 3.673 0.13 0.900
S = 4.30674 R-Sq = 83.4% R-Sq(adj) = 75.8%
Analysis of Variance
Source DF SS MS F P
Regression 6 1214.08 202.35 10.91 0.000
Residual Error 13 241.12 18.55
Total 19 1455.20
Y= Life Expectancy
β0 = Intercept
β1 =CoefficientforSanitationScore
Β2 = CoefficientforHealthcare ProductionIndex
Β3 =CoefficientforAccesstoImprovedWaterQuality
Β4 = CoefficientforFoodProductionIndex
Β5 =CoefficientforInteractionof FoodProductionIndex andWaterQuality
Β6 = CoefficientforAirQuality
Important observations:
1. Our overall model’sp-value is.001,whichisan increase fromthe othermodel.
2. Our adjustedR2
ais76.5%, whichislowerthanour initial firstorderlinearmodel.
3. We rejectourH0, whichstatesthat our model isuseful inpredictinglife expectancy
Conclusion:
In conclusion,we have developedanequationthatholdsarelativelyhighregressionscore.The first
model isourbestmodel indetermininganequationforlife expectancy.