Like this presentation? Why not share!

# Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

## by apanugan on Apr 06, 2011

• 4,548 views

Identified the most crucial factors that influence Human Development Index through regression analysis using Minitab software

Identified the most crucial factors that influence Human Development Index through regression analysis using Minitab software

### Views

Total Views
4,548
Views on SlideShare
4,544
Embed Views
4

Likes
0
41
0

### Accessibility

Uploaded via SlideShare as Microsoft PowerPoint

## Factors influencing the Human Development Index (HDI) using Multiple Linear RegressionPresentation Transcript

• Factors influencing the Human Development Index (HDI) using Multiple linear regression
1202062944
Industrial Engineering
Year of data: 2008
Source: UN Development Programme Database
• Objective and Dataset description
To find which of the following variables have an effect on the Human Development Index (HDI)
• Fitting the full model without interaction terms
The regression equation for full model is
y = 0.0596 + 0.00440 LIF + 0.000007 GDP - 0.000748 GRO + 0.0158 SCH + 0.0080 GEN+ 0.0159 EXP - 0.000004 GNI + 0.000003 MAT - 0.000051 HOM - 0.000540 MOR+ 0.000176 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5
Difficult to interpret the coefficients of the above regression equation.
Hence standardized the regression coefficients using Unit Normal scaling
• Fitting the full model after Standardization
The regression equation is
y = 0.684 + 0.0404 LIF + 0.100 GDP - 0.0117 GRO + 0.0408 SCH + 0.00136 GEN+ 0.0443 EXP - 0.0627 GNI + 0.00089 MAT - 0.00068 HOM - 0.0196 MOR+ 0.00259 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5
Model Statistics:
R-Sq = 98.5% R-Sq(adj) = 98.2%
Analysis of Variance (ANOVA)
Source DF SS MS F P
Regression 17 2.21784 0.13046 325.49 0.000
Residual Error 84 0.03367 0.00040
Total 101 2.25150
• Signs of Multicollinearity
Inference from Variance Inflation Factor (VIFs):
VIF of GDP = 560.116 and VIF of GNI = 533.109 (Indicating Severe Multicollinearity)
VIF of EXP = 18.368 and VIF of GRO = 16.456 (just over 10; Indicating Multicollinearity)
Inference from Correlation matrix:
LIF GDP GRO SCH GEN EXP GNI MAT
GDP 0.595
GRO 0.719 0.630
SCH 0.603 0.553 0.776
GEN -0.677 -0.705 -0.758 -0.743
EXP 0.692 0.636 0.956 0.774 -0.798
GNI 0.584 0.999 0.618 0.539 -0.688 0.620
• Dropped GNI from the model.
• No change in R-sq and R-sq(adj) statistics before and after dropping the model
R-Sq = 98.5% R-Sq(adj) = 98.2%
To confirm Multicollinearity between EXP and GRO, did a further analysis using Principal Component Analysis.
Found the condition number to be (Condition number = λmax/ λmin=7.8001/0.0327 = 238.53
>100, indicating moderate multicollinearity
• Dropped EXP also from the model and check the Model summary statistics- a slight reduction in R-sq and R-sq(adj) .
• Residual plots and Model Adequacy
Both normality and Residual vs fitted plots look good, satisfying the normality and constant variance conditions
• Indicator Interactions
Considered interaction terms of DEP and other numerical variables.
24 variables in all including all the interaction terms
S = 0.0220704 R-Sq = 98.3% R-Sq(adj) = 97.8%; R-Sq(pred) = 96.80%
Residual plots:
• Outliers and Influential points
• Other outliers in graph
Fitting each of the datapoints 45, 50, 80 and checking if there is any changes in summary stats
These points are not contributing to any leverage, nor being influential; except for the fact that they are outliers; also R-sq not changing much, therefore we are leaving them in the model.
• Residual plots after taking off the outliers and influential points
• Normal probability plot looks good but the Residuals vs fit looks like a double bow shaped.
• To confirm this, we have used box cox transformation which showed us that there is a need in the transformation on ‘y’
• Box-Cox Transformation
Suggests lambda = 2, implies transform y  y2
• Residual plots after transformation
Can find some outliers in the Normal probability plot
• Outliers and Influential points
• Residual plots after taking off the outliers and influential points
No need for any transformation, Box-Cox suggests λ = 1
• Variable selection and Model building
• Fit the selected model
Regression equation:
y2= 0.476 - 0.0164 GEN + 0.0403 GRO + 0.0422 LIF + 0.0557 GDP + 0.0449 SCH - 0.0181 CON2 - 0.0388 MOR + 0.0523 GDP_D + 0.0289 CON5 + 0.0412 MOR_D - 0.0476 HOM_D
Detected Multicollinearity using Principal component analysis
condition number = 134.837 (>100, Moderate Multicollinearity)
Linear dependency equation: 0.107GRO+0.337LIF+0.798MOR-0.467MOR_D (dependency between the variables in the equation)
Using correlation matrix found that the variable MOR has large correlation with LIF and MOR_D.
Dropping MOR removed multicollinearity from model (condition number = 39.04617 (<100, No multicollinearity)
• Residual plots after dropping MOR
• Presence of an outlier  datapoint 72
• No need for any transformation, Box-Cox suggests λ = 1
• Fit the model after dropping off the outlier
The regression equation is
y2= 0.482 - 0.0221 GEN + 0.0436 GRO + 0.0576 LIF + 0.0528 GDP + 0.0483 SCH - 0.0115 CON2 + 0.0556 GDP_D + 0.0182 CON5 + 0.0169 MOR_D - 0.0538 HOM_D
R-sq = 99.1% R-sq(adj) = 99% R-sq(pred) = 98.73%
• Model validation
Considered 118 countries for modelling
102  Estimation data and 16  prediction data
• Conclusion
The reduced model has a better R-sq than the actual model and most of the variables are significant (low p-value) in the model.
The following variables were found to be significant
Gender inequality index
Combined gross enrolment
Life expectancy at birth
GDP
Mean schooling years
Countries in continent 2
GDP& intensity of deprivation
Under 5 mortality rate& intensity of deprivation
Homicide rate& intensity of deprivation
• Possible improvements
More datapoints
Ridge regression to eliminate multicollinearity
Robust regression – to add more weight to the datapoints and retain them in the model.