1. 1 | P a g e
15th November, 2017
BANA7041 HW: STATISTICAL
METHODS - MODULE II
SECTION 001
PART A WRITTEN/ PART B COMPUTATIONAL
ASSIGNED GROUP NUMBER: 2
GROUP MEMBER NAMES:
Datta, Sourapratim (M12399768)
Kalra, Ravish (M12382149)
Popuri, Venkata Sai Lakshmi Srikanth (M12388241)
2. 2 | P a g e
Alumni Donation Case Study
INTRODUCTION
This study addresses the available constituent database of universities in an effort to identify the
criteria that are the strongest predictors of alumni giving rate at private and public universities
located in U.S.A. The analysis was done largely using R and SAS. After several tentative modifications
on our model, our best result achieves an R2
score of 0.719, which outperforms the baseline model.
DATASET OVERVIEW
The Alumni dataset describes the Alumni Giving Rate of universities in the US. It contains a list of 48
observations in total. Each observation contains the following variables:
• School: The name of the institutions.
• Student/Faculty Ratio: The ratio of the number of students who attend a school or university
divided by the number of faculty in the institution.
• % of Classes Under 20: The percentage of classes with fewer than 20 students
• Private: A categorical variable indicating whether the institution is privately owned or not.
• Alumni Giving Rate: The percentage of alumni giving back to the institutions.
Additional data was collected to improve the prediction rate of the model from the site:
vault.hanover.edu/~dodge/Statistics/DownloadData/Alumni%20Giving.xls
The additional predictor variable used for the model is Graduation Rate - The average graduation
rate of students for the institution in the given period.
Distribution of the variables:
The figures 1 and 2 show the distribution of the variables used in the prediction of the model.
Figure 1 -Distribution of variables
3. 3 | P a g e
Figure 2- Distribution of variables
Relation between the variables
Figure 3 shows the scatter plot between the variables.
Figure 3 - Scatter plot of the variables
4. 4 | P a g e
Graduation Rate % of Classes
Under 20
Student/Faculty
Ratio
Alumni Giving
Rate
Graduation Rate 1.0000000 0.5827884 -0.6049379 0.7559436
% of Classes
Under 20
0.5827884 1.0000000 -0.7855593 0.6456504
Student/Faculty
Ratio
-0.6049379 -0.7855593 1.0000000 -0.7423975
Alumni Giving
Rate
0.7559436 0.6456504 -0.7423975 1.0000000
Table 1 - Correlation between the variables
As can be observed from the scatter plot, there may be correlation between the Graduation rate and
the Alumni Giving Rate, which is confirmed from the correlation table which shows 0.75 correlation
between the two variables.
FITTING LINEAR REGRESSION MODEL
Base Model
A multiple linear regression using the Alumni Donation Data was conducted.
• The Alumni Giving Rate served as the response variable (Y)
Three predictor variables were used in the model:
• Percentage of classes with fewer than 20 students (X1)
• Student/faculty Ratio (X2)
• Private (X3)
The initial estimated model is given as
𝒀̂ = 36.784 + 0.077X1 – 1.398X2 + 6.285X3
Improved model
Performing a partial F-test with the additional variable Graduation Rate (X4) in our improved model,
we get an F-statistic value of 113.2 with p-value of 2.666e-05. Given a significance level of 0.05, and
under the null hypothesis: β5 (coefficient of X4) = 0, we reject the null hypothesis.
The improved model is thus given by:
𝒀̂ = -25.549 – 0.0809X1 – 0.815X2 + 7.555X3 + 0.7652X4
This model suggests that the following:
• With every unit increase in the percentage of classes with under 20 students, the Alumni
Giving Rate decreases by 0.0809.
5. 5 | P a g e
• With every unit increase in the Student/ Faculty Ratio, the Alumni Giving Rate decreases by
0.815.
• Private institutions have 7.555 times more Alumni Donation Rate than public institutions.
• With a unit increase in Graduation Rate, the Alumni donation rate increases by 0.7652.
MODEL DIAGNOSTICS
Performing diagnostics on the residuals of the improved model, and based on the plotted graph in
figure 5, the following conclusions can be made about the model residuals:
• Leaving just point 21 and 33 in the residual v/s fitted graph (Figure 4 top left), the residuals
appear to follow constant variance. No specific quadratic pattern is visible, indicating that
linearity of the regression function holds.
• Normal Q-Q plot (Figure 4 top right) - the normality is violated towards both the ends of the
line. Point 21 is notably far on top of the curve but overall this model’s residuals appear quite
normally distributed. With reference to previous homework, the residuals from this model
appear more normally distributed.
• Checking for model outliers, suggest that point 21 with a value of 21.88 is an outlier. Refer
to Figure 5
• The points 9, 21, 26 and 43 were found to be influential points. Refer to Table-2
• A point is categorized as a high leverage point if the value of hii associated with an
observation is greater than 2p/n, where p is the number of parameters and n is the sample
size. For this model, 2p/n is equal to 0.208. Based on this threshold, there are 2 points in the
data that are high leverage points. These points are the observations 1 and 43.
Figure 4- Model Diagnostics
6. 6 | P a g e
Figure 5- Residual Boxplot
Influence measures of
lm(formula = y ~ x1 + x2 + x3 + x4) :
dfb.x1 dfb.x2 dfb.x3 dfb.x4 dffit cov.r cook.d hat inf
9 0.01 -0.01 -0.01 -0.01 -0.01 -0.01 1.35 4.12e-05 0.17 *
21 -0.05 -0.50 -0.03 -0.35 0.39 1.03 0.36 1.71e-01 0.08 *
26 0.04 -0.03 -0.07 0.06 -0.01 -0.10 1.37 2.00e-03 0.19 *
43 -0.22 0.21 0.07 0.14 0.12 -0.26 1.41 1.42e-02 0.20 *
Table 2- Influence and Leverage Points
CONCLUSION
Evaluation Metrics
R2
value, which is a common statistical measurement of regression model that accounts for the
variations explained by the model to the given values, has been used to evaluate the model.
Result
Two models having different predictor variables have been compared based on the R2
value.
• Base Model: This model having 3 predictor variables gave us a Multiple R2
value of 0.5747
and an Adjusted R2
of 0.5457.
• Improved Model: This model having an additional predictor variable gave us a Multiple R2
value of 0.7191 and an Adjusted R2
of 0.693. Using Graduation Rate as an additional predictor
for our Linear Regression Model leads to a better Multiple and Adjusted R2
value.