2. Focus on predicting academic
performance of an elementary
school using attributes
Class size
Enrollment
Poverty
Parent Education
Student performance
Teachers credentials
from California Department of
education’s API2000
dataset
3. The project aims in constructing a mathematical
model using Multiple regression to estimate the
academic performance based on a set of
predictor variables.
Analysis Software used- SAS(Statistical Analytical
Software)
4. Variables Used for Analysis
We have 3 independent
variables and 1dependent
variable . We screen variables
based on
Multicollinearity
Heteroscedusticity
&
Normality test
6. Equation for Multiple Regression
Y=394.23-2.82 x1+4.21 x2+3.27 x3
Where
X1=not_hsg i.e. parent not high school graduate
X2=grad-sch i.e. parent grad school
X3=full i.e. pct full credential
7. Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 3 5702793 1900931 317.51<.0001
Error 396 2370879 5987.06896
Corrected Total 399 8073672
Root MSE 77.37615 R-Square 0.7063
Dependent Mean 647.6225 Adj R-Sq 0.7041
Coeff Var 11.94772
Parameter Estimates
Variable Label DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept Intercept 1 394.23899 25.23765 15.62<.0001 0
not_hsg parent not hsg 1 -2.82726 0.21549 -13.12<.0001 1.32287
grad_sch parent grad school 1 4.21761 0.35989 11.72<.0001 1.27024
full pct full credential 1 3.27664 0.27704 11.83<.0001 1.14321
9. The F-Value is 317.51 and P value is <0.0001,
so the regression model is significant.
The P-value for the t-statistic of the selected
variables are all <=0.0001, so all the variables
are significant in the model.
The R-square is 0.7083, which means 70.83%
of the total variability is explained by the
parent not high school, parent grad school, pct
full credential.
Main Points from SAS output:
10. Explanation of F-test:
General equation of predicted y is
Y=b0+b1*x1+b2*x2+b3*x3
One of the b’ s is zero. When we remove independent
variables from the model, we are restricting its
coefficient to be zero.
H0:b2=b3=b4=0
H1:at least one bi not equal to 0
We call this a test of overall model significance. If we
accept Ho our model has explained nothing. If we
reject Ho our model has explained something.
Here P(F>317.51)<0.0001we reject H0 our model
has explained something.
11. Explanation of BETA COEFFICIENT:
implies
i)academic performance & not_hsg are inversely
related due to the coefficient “-2.82”keeping other
variables are fixed .
ii)academic performance increase if grad_sch
increase due to the positive coefficient keeping
other variables are fixed.
iii)similarly the above equation indicates direct
relationship between academic performance & full
variable.
Y=394.23-2.82 x1+4.21 x2+3.27 x3
12. 0
100
200
300
400
500
600
700
800
900
10 20 30 40 50 60 70
p
e
r
f
o
r
m
a
n
c
e
not_hsg
performance vs not_hsg
predicted y
0
100
200
300
400
500
600
700
800
900
1000
10 20 30 40 50 60 70
p
e
r
f
o
r
m
a
n
c
e
grad_sch
performance vs grad_sch
predicted y
640
660
680
700
720
740
760
780
800
820
70 75 80 85 90 95 100
p
e
r
f
o
r
m
a
n
c
e
full
performance vs full
predicted y
13. Explain effect of each
independent variables
selected by
Regression Model:
Y=394.23-2.82 x1+4.21 x2+3.27 x3
14. If we consider a set of 50 students from 11 different
school with different educational background of
parents we need different percentage of teaching
credential to achieve same score of academic
performance.
school no: not_hsg grad_sch full y
1 0 50 28.95 700
2 5 45 39.70 700
3 10 40 50.45 700
4 15 35 61.20 700
5 20 30 71.95 700
6 25 25 82.70 700
7 30 20 93.45 700
8 35 15 100 686.21
9 33 17 100 700.30
10 45 5 100 615.76
11 50 0 100 580.54
15. In brief, to achieve score 700 we consider with full teacher’s credential
at most 33 students whose parent are not high school graduate .If
the number of this kind of students decrease it is easy to reach our
target.
On the other hand, predicted value of y will be maximized if parents of
each student are in graduate school and full=100
If the number of parents are not high school graduate increase 10% of
total number of student & percentage of full teacher’s credential
increase 10% ,following graph shows the change of predicted value.
45, 100
45, 615.76434
0
100
200
300
400
500
600
700
0 10 20 30 40 50
predictedy
not_hsg
predicted y & full vs. not_hsg
full
y
16. 0
50
100
150
200
250
300
350
400
450
0 10 20 30 40 50 60 70 80 90 100
Predicted error%
cumulative distribution of predicted error%
ecdf
CUMULATIVE DISTRIBUTION OF PREDICTION ERROR %
1
0.875
0.75
0.625
0.5
0.375
0.25
0.125
0
The formula is (abs(actual-predicted)*100/actual).Following chart
shows that 75% of cases have <15% error & 87.5% have <22% error
17. Conclusion
we are able to predict academic
performance &
we have a good R-square of 0.7083
i.e.
70.83% of the variability is
explained by the model &
we are also able to explain the
interpretation of the estimates of
the model .