Average performance prediction of elementary school using multiple regression

Average Performance Prediction of
Elementary School using Multiple
Regression
Submitted By:
Ankur Khandelwal
Anurag Shandilya
Pullahbhatla Apuroop
Srikanth Mallya

Agenda
1. Introduction
2. Business Objective
3. Factors and Their Influence
4. Final Regression Model and Variables
5. Inferences drawn from Analysis
6. Appendix

Introduction
The Dataset contains the following:
• Performance of 400 elementary schools from the California
Department of Education and factors like class size, parent
education, student performance,etc.

Business Objective
• To find the factors having major influence on the academic
performance.
• To predict academic performance of an school using those
factors.

Factors & their Influence
Factors which has been chosen on the basis of statistical significance:
Factors Impact
English language learner(ELL) Negative
Percentage first year in school (Mobility) Negative
Average Class size k-3 (ACS_k3) Positive
Parent College Grade(col_grad) Positive
Parent Grad School(grad_sch) Positive
Percentage Emergency Credential(emer) Negative

Regression Model & Variables
Regression Equation
API100 = 563.4-2.7*(ell)-
2.46*(mobility)+10.63*(acs_k3)+1.20*(col_grad)+2.84*(grad_sch)-2.8*(emer)
Variable Label Parameter
Intercept Intercept 563.4395
ell english language learners -2.77842
mobility pct 1st year in school -2.46762
acs_k3 avg class size k-3 10.6312
col_grad parent college grad 1.20298
grad_sch parent grad school 2.8468
emer pct emer credential -2.7947

Inferences Drawn from Analysis
• ell: If the count of ell is more it means students are weak in
English and it can affect their performance in other subjects as well.
It may degrade the performance of a student.
• mobility: If the count of mobility is more it means more number of
students are dropping out from school in the first year. Schools
with high mobility rate shows the low API value.
• grad_sch: Students whose parents have graduation as highest
education and having guidance from their parents. This can highly
influence their performance in school.

Conti..
• emer: Part time teachers can highly influence API value because
weak students cannot have “anytime access” to qualified teachers.
So the teachers available in emergency is highly responsible to
affect the school’s performance.
• acs_k3: Higher the size of the class in the school higher will the
performance of so we can see that average class size k-3 has the
positive contribution on the average performance index for the
schools.
• col_grad: Students whose parents have graduation as highest
education and having guidance from their parents. This can highly
influence their performance in school. Higher the graduation of the
parents higher will be the performance of the students in the
schools.

Appendix
• Missing Values and outliers Treatment
• Test For Regression
• Check for Multicolinearity
• Check For Significance of individual Parameter
• Check for Hetroscedasticity
• Check for Normality
• Mean Absolute Percentage Error
• Check for R-Square Value

Missing Value and Outlier Treatment
Before the treatment After the treatment

Test for Regression
Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 6 6282718 1E+06 229.78 <.0001
Error 393 1790954 4557.1
Corrected Total 399 8073672
• This is done to check the over all significance of the model:
• H0: Independent variables collectively or individually can’t influence the dependent
variable.
• H1: The independent variables collectively or individually can influence the
dependent variable.
• If P-value>α:H0 can’t be rejected & hence the model is useless.
• If P-value<α: H0 is rejected & hence some independent can influence the dependent
variable.
• In this case the Pvalue<α & hence some independent variables can influence the
dependent variable.

Check for Multicolinearity
Parameter Estimates
Variable Label DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept Intercept 1 563.43951 49.84285 11.3 <.0001 0
ell English language learners 1 -2.77842 0.17562 -15.82 <.0001 1.66602
mobility pct 1st year in school 1 -2.46762 0.47464 -5.2 <.0001 1.10217
acs_k3 avg class size k-3 1 10.6312 2.50946 4.24 <.0001 1.02771
col_grad parent college grad 1 1.20298 0.24159 4.98 <.0001 1.3863
grad_sch parent grad school 1 2.8468 0.34247 8.31 <.0001 1.51113
emer pct emer credential 1 -2.7947 0.33022 -8.46 <.0001 1.31735
• This happens when the independent variables are highly interdependent.
• Hence the individual impact on the dependent variables can’t be correctly estimated.
• The extent of multicolinearity is captured by the variance inflation factor(VIF).
• The final model must have only those variables having VIF ranging from 1.5 to 2.

Check For Significance of individual
Parameter
Parameter Estimates
Variable Label DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept Intercept 1 563.43951 49.84285 11.3 <.0001 0
ell English language learners 1 -2.77842 0.17562 -15.82 <.0001 1.66602
mobility pct 1st year in school 1 -2.46762 0.47464 -5.2 <.0001 1.10217
acs_k3 avg class size k-3 1 10.6312 2.50946 4.24 <.0001 1.02771
col_grad parent college grad 1 1.20298 0.24159 4.98 <.0001 1.3863
grad_sch parent grad school 1 2.8468 0.34247 8.31 <.0001 1.51113
emer pct emer credential 1 -2.7947 0.33022 -8.46 <.0001 1.31735
• The P values of the variables are checked for the significance
• Variables having P value>α are not important for the model
• The final model must have variables having P value>α & VIF ranging from
1.5 to 2.

Check for Hetroscedasticity
Test of First and Second
Moment Specification
DF Chi-Square Pr > ChiSq
27 53.67 0.0017
• This occurs when the variance of the random error component is not
constant.
• The White’s test used for the check for Heteroscedasticity
• Null Hypothesis: Model is Homoscedastic.
• If P value>α:H0 can’t be rejected & hence the model is Homoscedastic &
vice-versa.
• The VIF SPEC option is used to check for the Heteroscedasticity.

Check for Normality
• Once the model has only the significant variables the o/p file created.
• The o/p file contains the predicted & the residual variables.
• The residual variables saved in the o/p file for normality
• This is done using the proc univariate with normal option

Mean Absolute Percentage Error
The Means Procedure
Analysis Variable : ERROR
Mean
8.7668826
• Mean absolute percentage error or MAPE captures the overall %
error of the model.
• Ideally MAPE should be with in 10%.

Check for R-Square Value
Root MSE 67.507 R-Square 0.7782
Dependent Mean 647.62 Adj R-Sq 0.7748
Coeff Var 10.424
• This captures the proportion variation that can be explained by the linear
regression.
• Higher the value of R-square, better the explanatory power.
• This acts as a measure of goodness of fit of the model.
• R-square value should be at least 65% or .65.

Average performance prediction of elementary school using multiple regression

Average performance prediction of elementary school using multiple regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Average performance prediction of elementary school using multiple regression

Similar to Average performance prediction of elementary school using multiple regression (20)

Recently uploaded

Recently uploaded (20)

Average performance prediction of elementary school using multiple regression