This project was a part of our coursework - Applied Regression Analysis.
In this project, our aim was to find the relationship between One Independent and Four dependent variable.
To understand how the followers are increases on twitter, so we took No of followers as our Independent variable and Years Since they joined, Number of years passed since that person has joined, Number of Photos and Videos posted and Number of People that person is following back as our dependent variable and performed Multiple linear regression analysis.
2. 1 | P a g e
Project Proposal
Problem statement:
In this project of simple linear regression analysis, we are trying to determine the relationship between one
response variable (Y) and four predictor variables (X1, X2, X3, X4). We want to determine how the different
values of all the predictor variables affect the value of the response variable. And among all predictor
variable, which variable shows the most linear relationship with the response.
The variables:
The variables are as follows
Response Variable (Y): Number of the followers of a person on the Twitter (in millions)
Predictor Variable (X1): Number of tweets posted by a particular person
Predictor Variable (X2): Number of years passed since that person has joined the twitter
Predictor Variable (X3): Number of photos and videos posted
Predictor Variable (X4): Number of people that person is following back
The data collection process:
We have used the official site of Forbes to find the first 100 most followed people on the twitter
(http://www.forbes.com/sites/maddieberg/2015/06/29/twitters-most-followed-celebrities-retweets-dont-
always-mean-dollars/#35671f137ef3)
Also we are using twitter to collect the data for the response variable and each predictor variables.
(www.twitter.com)
We searched for these people on the twitter for their accounts and verified them as their original account
by looking at the symbol , which is the symbol for the official account of persons on the twitter. We
are using the first 40 most followed people amongst them.
Why modeling this data set would be meaningful?
As we all know how much social media is affecting the lives of people in today's world and it is constantly
evolving with the new social media sites. It has revolutionized how we look at the things. While some
people have made their way to success through their hard work and years of experience, the others have
made it through social media which also includes the same amount of hard work. Amongst all of them
today, the second most popular social media site is Twitter according to ebizmba.com
(http://www.ebizmba.com/articles/social-networking-websites). It is necessary to understand on which
factors the popularity of the people is driven. We are going to determine that among all these four predictor
variables, which variable affects the popularity of the person the most.
3. 2 | P a g e
Matrix Scatterplot and Pairwise Correlation Coefficient.
4. 3 | P a g e
The scatter plot matrix helps us to understand the relationship between the response variable i.e. No of
followers(Y) and various predictor variables (No of tweets(X1), Years since they joined(X2), No of photos
uploaded(X3) and Following back(X4).It also shows the scatterplot between the predictor variable such as
X1 vs X2, X1 vs X3, X1 vs X4, X2 vs X3, X2 vs X4 and X3 vs X4.
Correlation coefficient matrix is between response(Y) and predictor variable(X) and predictor(X) vs
predictor variable(X).The value ranges from 0 to 1 when correlation is between response and predictor. If
the value is greater than 0.7, equal to 0.5 and less than 0.3 is consider high, medium and low correlation
respectively. It is always good to have high (i.e. greater or equal to 0.7) correlation between response and
predictor variable.
Large correlation coefficient between predictors indicates multicollinearity. In Multicollinearity ,we
consider correlation coefficient should be between -1≤ r ≤ +1.If the correlation coefficient between two
predictor variable is greater than zero then high value of one predictor with high value of another predictor
and low values of one predictor with low value of another predictor. If the correlation coefficient is less
than zero then high value of one predictor occur with low values of another predictor and low values of one
predictor with high values of another predictor. It is not good to have high correlation coefficient between
two predictors because high correlation indicates severe multicollinearity. Multicollinearity can cause
increase in the variance of coefficient estimates and make estimates sensitive to the change in the model.
We always want correlation coefficient between predictors near zero.
Below is the discussion from the scatterplot and correlation matrix.
Y Vs X plots
No of followers and No of tweets (Y vs X1): Here, there is positive upward trend and the correlation is
0.14782. It is low correlation. No of followers and Years since they joined (Y vs X2): Here, there is
positive upward trend and the correlation is 0.51656. It is moderate correlation. No of followers and No of
photos uploaded(Y Vs X3): Here, there is a positive trend and the correlation is 0.32294.It is low
correlation. No of followers and following back(Y Vs X4): Here, there is positive upward trend and the
correlation is 0.47663. It is moderate correlation.
X vs X Plots
No of tweets vs Years since they joined(X1 vs X2): Here, there is a positive upward trend and correlation
coefficient is 0.14618, which is near to zero. These two predictors are very less correlated, which is good.
No of tweets vs No of photos uploaded(X1 vs X3): Here, there is a strong positive upward trend and
correlation coefficient is 0.77547.These two predictors are highly correlated, which is bad. No of tweets vs
following back(X1 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.18112,
which is near zero. These two predictors are very less correlated, which is good. Years since they joined
vs No of photos uploaded(X2 vs X3): Here, there is a positive upward trend and correlation coefficient is
0.20020, which is near zero. These two predictors are very less correlated, which is good. Years since they
joined vs following back(X2 vs X4): Here, there is a positive upward trend and correlation coefficient is
0.57997.These two predictors are moderately correlated, which is bad. No of photos vs following back(X3
vs X4): Here, there is positive upward trend and correlation coefficient is 0.23780, which is near zero.
These two predictors are less correlated, which is good. Overall we can say that there is a multicollinearity
problem.
Potential Complication
There is severe multicollinearity between X2 vs X3.
II PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL
5. 4 | P a g e
The general linear regression model is
Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi
Where: β0, β1, β2 …… βp-1 are parameters, Xi1…...,Xip-1 are known constants, εi are independent N (0, σ2
) ,
i = 1,….n
Fitted model is Ŷ (No of followers) = -25.40410 – 0.00044178No of tweets + 8.15496Years since they
joined + 0.00877No of photos uploaded + 0.00003188Following back
Model Assumption
For model adequacy, we need to satisfy following model assumptions
a) The current MLR model is reasonable b) The residuals have constant variance.
c) The residuals are normally distributed. It is not required but desired. d) The residuals are uncorrelated
e) No outliers f) the predictors are not highly correlated with each other.
a) The current MLR model form is reasonable
Yi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi,
Where Yi is the No of followers ,β0, β1, β2, β3, β4 are the parameters Xi1: No of tweets,Xi2 Years since they
joined,Xi3 No of photos uploaded ,Xi4 Following back
6. 5 | P a g e
From the above graph of residual vs each predictors i.e. No of tweets ,Years Since they joined ,No of photos
uploaded ,Following back, we can say that there is no curvature in all graph .This shows that MLR model
is reasonable.
b) The residuals have constant variance.
Residual vs Ŷ : This plots helps us to understand whether residual have constant variance or not.
From above graph, we can say that there is a funnel shape. So residuals have non constant variance.
Modified Levene Test: Test for Constancy of Error Variance.
7. 6 | P a g e
F test
H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From above table, p=0.0006, which is less than 0.05. So we reject Ho. It is strong conclusion.
We conclude that error variance is not constant
Two Sample T test
Now, as variance is not constant, we will look into the unequal variance row.
H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From table, p=0.0482, which is less than 0.05.So we reject H0. We conclude that error variance is not
constant
c) The residuals are normally distributed
Normal probability plot: This plot helps us to understand whether residuals are normally distributed or not.
This plot is residual versus expected scores.
From above graph, we can say that residuals are right skewed. So residuals are not normal.
Test for Normality
8. 7 | P a g e
H0: Normality is Okay H1: Normality is violated
Decision rule: ρ̂ < c (α, n) then Reject H0. α=0.10
From the critical values table c (α, n) = 0.977 and from above table ρ̂ = 0.92338, which is less than 0.972.
So we reject H0. It is a strong conclusion. We conclude that Normality is violated
d) The residuals are uncorrelated
Data were not collected in time order, so time plot is not relevant.
e) Outliers: There are no outliers
f) The predictors are not highly correlated with each other.
Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It
measures how much variances of the estimated regression coefficient are inflated as compared to when
predictors are not linearly related. It can found by regressing Xk on other p-2 predictors. Formula for VIF
is
(VIF) k = 1 / 1-Rk
2
where R2
k is coefficient of multiple determination.
Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with
any (VIF) k >5. (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274.
We can say that No of tweets, Years since they joined, No of photos uploaded and following back is 2.5,
1.51, 2.58, 1.54 times inflated as compared to when predictors are not linearly related. From above table,
we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much bigger than 1.So conclude
that serious multicollinearity is not a problem.
Transformation
There was a non-constant variance and non-normality of the residual, so to get the model assumption
satisfies, we need to perform transformation. Our main goal of transformation is to get non constant
variance. As no transformation can able to make normality okay. We started our variance stabilizing
transformations from weakest to strongest. Weakest is the sqrt transformation Ý = sqrt Y i.e. λ=0.5 and the
strongest is Ý = -1/Y, λ=-1.Sqrt root transformation could not satisfies our model assumption so we move
to log transformation (i.e. λ=0).
a) The current MLR model form is reasonable
9. 8 | P a g e
LogtenYi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi
Where
LogtenYi: No of followers ,β0, β1, β2, β3, β4 are the parameters
Xi1: No of tweets, Xi2 Years since they joined
Xi3 : No of photos uploaded ,Xi4: Following back
From the above graph of residual vs each predictor i.e. No of tweets, Years since they joined No of photos
uploaded and following back, we can say that there is no curvature in all of the graphs. So MLR model is
reasonable.
b) The residuals have constant variance
Residual vs log10Ŷ
From above graph, we can say that there is no funnel shape, so variance is constant.
10. 9 | P a g e
Modified Levene test
Here α=0.05, F test: H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From table, p=0.1910 which is greater than 0.05. So we fail to reject Ho. We conclude that variance is
constant. As variance is constant, we move to equal variance row of t test.
Two Sample T test: H0: Variance is constant H1: Variance is not constant
Decision rule: p<alpha then Reject H0. α=0.05
From table, p=0.0936, which is greater than 0.05.So we fail to reject H0. We conclude that variance is
constant
C) Normality plot and Normality test
Normality test
α=0.05 , H0: Normality is Okay H1: Normality is violated
11. 10 | P a g e
Decision rule: ρ < c (α, n) then Reject H0.
From the critical values table c (α, n) = 0.972 and from the above table ρ = 0.974, which is greater than
0.972. So we fail to reject H0. It is a weak conclusion. We conclude that normality is okay. If we take
α=0.10, then it will fail the normality test but in multiple linear regression, normality assumption is not
required but desired, so we can move ahead with this result. The important assumption that need to be
satisfy is non constant variance and we have gained with log transformation so we have stopped
transformation at log.
d) Data were not collected in time order, so time plot is not relevant.
e) Diagnostic
Outlying X observation
The hat matrix is helpful in identifying the X outliers. The diagonal element of the hat matrix is helpful in
finding X outliers. The diagonal element hii values is between 0 and 1 and their sum is p. p is no of
parameters. hii is the distance between the Values of ith
case and mean of X values of n cases. The diagonal
element in this context is called leverage of X values.
If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is
consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier.
2p/n= (2*5)/39 = 0.2564. Now we will look which value is greater than 0.2564. By comparing, we got 3,
9,11,15,31 and 32 observation are greater than 0.2564. So observation 3, 9,11,15,31 and 32 are x outliers.
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test.
12. 11 | P a g e
Bonferroni outlier test
α=0.10, n=39, p = 5
Bonferroni critical value = |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 33) = t (0.9987, 33) = 3.25817
To find Y outliers, we will look which observation value is greater than 3.25817.By comparing, we could
not found any Y outliers.
So from above two results, we found 3,9,11,15,31,32 as X outliers.
Influence
After finding outlier with respect to X values, the next step is to look whether these X outliers are influential
or not. To check their influence, we need to perform influence measures. There are three influence measures
1) DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (5/39) = 0.71611
Observation No Type of Outlier DFFITS Value from Table Remarks
3 X -1.3807 From comparison, we can say that
observation 3 have higher value
than 0.71611.This observation
have influence on the fitted value.
Of no of followers
9 X -0.3078
11 X -0.6020
15 X -0.0438
31 X -0.6945
32 X 0.0242
2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of Intercept,
No of tweets, Years since they joined, No of photos uploaded and Following back. Here we will look
absolute value of DFBETAS. A large absolute value of DFBETAS is consider as an influential.
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation
No
Type of
Outlier
DFBETAS value from table Remarks
Intercept X1 X2 X3 X4 Outlier 3 and 31
is influencing on
Following back
predictor
coefficient and
No of tweets
predictor
coefficient
respectively.
3 X 0.0599 0.1398 -0.0490 -0.0977 -1.0565
9 X -0.0173 0.0315 0.0061 0.0439 -0.2427
11 X 0.0519 0.2885 -0.0207 -0.5411 0.1529
15 X 0.0005 -0.0241 0.0021 -0.0033 0.0029
31 X 0.0824 -0.5061 -0.0539 0.1173 0.1629
32 X 0.0802 -0.0093 -0.0203 0.0124 0.0083
13. 12 | P a g e
3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an
aggregate influence measure of the ith
case on all n fitted values. It is denoted as Di.
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.8878
Observation No Type of Outlier Cook’s value form table Remarks
3 X 0.21800 None of the observation
is above F value, so we
can say that none of the
observation is
influential on all n fitted
values of no of
followers
9 X 0.06657
11 X 0.08801
15 X 0.00886
31 X 0.08420
32 X 0.00127
f) Variance Inflation Factor
Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with
any (VIF) k >5
(VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
The preliminary model is
14. 13 | P a g e
Log10No of Followers = 0.87514 – 0.00000570No of tweets + 0.8402Years since they joined +
0.00011227No of photos uploaded + 3.081619E-7Following back
The parameter estimates are b0: 0.87514, b1 : -0.00000570, b2: 0.08402, b3 : 0.00011227, b4: 3.081619E-7
There would be decrease in no of followers when no of tweets increase by one unit, given that other variable
are held constant. There would be 0.08402 increase in no of followers when Years Since they joined
increase by one year, given that other predictors variable are held constant. There would be 0.00011227
increase in no of followers when No of photos uploaded increase by 1no,given that other predictors are held
constant. There would be 3.08E-7 increase in no of followers when following back is increase by 1no, given
that other predictors are held constant.
There are 39 observation, so the degree of freedom for the corrected total is n-1 i.e. 38.We have 4 predictors,
so the degree of freedom for the model is 4. The degree of freedom for the error is n-p-1 i.e. 39-4-1 = 34.
Standard error: Standard error are the standard error for the regression coefficient. It helps us for
constructing confidence interval.
Sum of Square: Sum of square is formed by SST (Sum of square total) = SSM (Sum of square model) +
SSE (Sum of square error). SST shows the total variation in the response. Sum of square error shows the
unexplained variation within the no of followers i.e. yi-̂yi .We can say that this variation is due to deviation
in the model and Sum of square model shows the explained variability i.e. ̂yi - ȳ . This shows variation due
to model.
Mean Square: It is calculated as ratio of sum of squares and its corresponding degree of freedom. Mean
square error is an estimate of the variance σ2
for our model. Value of MSE is 0.02251.
Root MSE: value is 0.15004.It shows the value of s, estimate for the parameter σ of our model.
Departure Mean: The value is 1.51431, this indicates the mean of logtenY.
F value: It is the test statistics to check whether the model is significant or not.
H0: β1=β2=β3=β4 = 0
H1: not all β1, β2, β3, β4 = 0
Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05
F* = 5.41, F (0.95, 4, 35) = 2.6414
As 5.41 > 2.6414, we reject Ho and conclude that not all β1, β2, β3, β4 is zero. In other words, regression is
significant.
Coefficient of Multiple Determination (R2
). From the table, the R2
value is 0.3891 fraction of variability
in No of followers explained by the No of tweets, Years Since they joined, No of photos uploaded, following
back.
Significance of Predictors
T value: It is the t statistics. It is a ratio of the parameter estimates to the standard error. The null hypothesis
is regression coefficient is zero. If the regression coefficient of predictor is zero, then it does not
significantly contribute to the model. We can drop the predictors when there is t* is less among all predictors
15. 14 | P a g e
Pr>|t| are two sided. Examination of t statistics and its corresponding p value helps us to find significance.
The p value for the No of tweets(X1) and following back(X4) is greater than α (0.05), so both these predictor
are not significant for predicting no of followers. The p value for the Years Since they joined (X2) is almost
equal to 0.05, so this predictor is significant and No of photos uploaded(X3) is less than to α (0.05), so both
these predictors are significant in predicting no of followers.
III Exploration of Interaction Terms using Partial Regression Plot
Partial regression plot is also called as Added-variable plot. Partial regression plot helps us to understand the
marginal role of the interaction term given other predictor variables are present. From this plot, we will
understand which interaction will be helpful to predict no of followers(Y). For selecting the graph, we need to
see trend i.e positive upward or negative. If there is no trend, then we should not add interaction.
The following interaction terms are possible
1) X1X2: No of tweets Year since they joined 2) X1X3: No of tweets No of photos uploaded
3) X1X4: No of tweets Following back 4) X2X3: Year Since they joined No of photos uploaded
5) X2X4: Year Since they joined Following back 6) X3X4: No of photos uploaded Following back
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x2 given X1, X2, X3,
X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there
is horizontal band, so this interaction does not contain additional information that is useful for predicting
no of followers(Y).
16. 15 | P a g e
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x1x3 given X1,
X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks
like there is horizontal band, so this interaction does not contain additional information that is useful for
predicting no of followers(Y).
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x4 given X1, X2, X3,
X4. From the plot, we can say that trend is positive, so this interaction may be helpful in predicting no of
followers.
17. 16 | P a g e
Here, we have regress residual of logtenY given X1, X2, X3,X4 versus the residual of x2x3 given X1, X2, X3,
X4. From the plot, we can say that there is no trend so this interaction will not provide any additional
information for predicting No of followers.
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x2x4 given X1,
X2, X3, X4. From the plot, we can say that trend is negative so this interaction will be helpful in predicting
no of followers
18. 17 | P a g e
Here, we have regress residual of logtenY given X1, X2, X3, X4 versus the residual of model x3x4 given X1,
X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be useful in predicting
no of followers.
From above interaction plots, the X1X2 and X1X3 plots have slightly negative trend so these two
predictors won’t be helpful. The X2X3 plots having no trend so no need to add in the model. Between
X1X4,X2X4 and X3X4, the X2X4 plot have more scatter around the regression line. So X2X4 is enough to
add as an interaction term.
Correlation involving the added interaction term before and after standardization. Standardization
is important for the model having the interaction and the polynomial terms. Standardizing the predictors
helps us to reduce the multicollinearity. In standardization, the standardized variable can be calculated by
centering the mean to zero and scaling the variance to 1. Centering the predictors is important for the
interaction terms.
Before Standardization
After Standardization
19. 18 | P a g e
From above results, we can see the effect of standardization. It helps to reduce the multicollinearity. From
after standardization correlation matrix, we can say that correlation coefficient between No of tweets, Years
Since they joined, No of photos and following back and Years since they joined*Following back has
reduced.
IV. Model Search
For the 4 predictor variable, the no of parameter would be p=5 and the total no of model would be 2p-1
.So
in our model the total no of model would be 24
= 16. It is difficult to access all the models, so to deal with
such complexity, we have three model search procedure and they are a) Best Subset b) Backward deletion
c) Stepwise
a) Best Subset: For selection of the model, there are three criteria need to check for potentially best model
1) Cp 2) AIC (Akaike’s Information Criterion) 3) SBC (Schwarz Bayesian Criterion)
Firstly, we will look R2
a in the model. At one stage Ra2
value will be level off. We will discard the model
whose Ra
2
decreased from previous step. Now we will use our mentioned criteria to find our model. When
Cp=p (no of parameters), then there won’t be any bias.
Model 1
20. 19 | P a g e
Number in
Model
R2
a
R2 Cp AIC SBC
1 0.2356 0.2557 8.7531 -141.5964 -138.26926
2 0.2863 0.3238 6.7459 -143.3425 -138.35179
3 0.3053 0.3602 6.6100 -143.4968 -136.84251
4 0.3173 0.3891 6.9077 -143.3032 -134.09534
5 0.3535 0.4386 6 -144.5965 -134.61512
By using mentioned criteria, the min Cp and min AIC is of Model 3 and min SBC is of Model 2.Though,
the model 5 is having min Cp and min AIC, we have not consider this model because there is severe
multicollinearity problem.
From this method, we have selected following models
Model 1: logtenY (No of followers) = β0+β1 Years since they joined+β2 No of photos uploaded
Model 2: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos
uploaded
b) Backward deletion
Model 2
21. 20 | P a g e
From backward deletion, we have selected following models
Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos
uploaded.
C) Stepwise
Model 1
22. 21 | P a g e
From stepwise, we have selected following models
Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of
photos uploaded.
From above model search procedure, we have selected following two models as our best models.
Model I: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of
photos uploaded.
Model II: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos
uploaded
Model 1
23. 22 | P a g e
V. Model Selection
Model I
LogtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos
uploaded.
Checking Model assumption
a) The current MLR model is reasonable
Yi = β0+β1xi1+β2xi2+εi
Where , Yi No of followers,Xi1 : Years since they joined Xi2 No of photos uploaded
From above graph, we can say that there is no curvature at all in any graph. So MLR model is reasonable.
b) The residuals have constant variance
Residual vs logtenŶ
24. 23 | P a g e
From the graph, we conclude that graph have constant variance as there is no funnel shape
c) The residuals are normally distributed
Normal Probability Plot
From the graph, we conclude that graph is normal. Normality is satisfied
d) The residual are uncorrelated. We have not collected data in time order.
e) Diagnostic
25. 24 | P a g e
Outlying X observation
If hii value is large then observation i is consider x outlier and has high leverage.
A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are
consider as X outlier.
2p/n= (2*3)/39 = 0.1538. Now we will look which value is greater than 0.1538. By comparing, we got 3,
11, 15 and 32 observation are greater than 0.1538, so observation 3, 11, 15 and 32 are x outliers.
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test.
Bonferroni outlier test
α=0.10, n=39, p = 3
|ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 35) = t (0.9987, 33) = 3.2431
To find Y outliers, we will look which observation value is greater than 3.2431.By comparing, we found
all observation is less than 3.25817, so there are no Y outliers.
So from above two results, we can say that there are total 3 X outliers.
Influence
26. 25 | P a g e
After finding outlier from X values, the next step is to look whether these outliers are influential or not. To
check their influence, we need to perform influence measures. There are three influence measures 1)
DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (3/39) = 0.5547
Observation No Type of Outlier DFFITS Value from Table Conclusion
3 X 0.1763 From comparison, we can say that
none of the observation have
influence on the fitted value.
11 X 0.2651
15 X 0.2602
32 X 0.4234
2) DFBETAS: It is helpful to understand influence of outliers on the regression coefficient of intercept,
years since they joined and no of photos uploaded.
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation no Type of Outlier DFBETAS value from table
Intercept X2 X3 3, 11, and 15 are
not influential
while observation
32 is slightly
influential on
intercept and
coefficient of years
since they joined
3 X 0.1554 0.1554 0.0297
11 X 0.0162 0.0388 0.2518
15 X 0.0147 0.0345 0.2410
32 X 0.3771 0.3836 0.1965
3) COOKDISTANCE: It is helpful to understand influence of outliers on all n fitted values. It is an
aggregate influence measure of the ith case on all n fitted values
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.80381
Observation No Type of Outlier Cook’s value form table Remarks
3 X 0.07852 None of the observation
is influential on fitted
values.
11 X 0.02396
15 X 0.02301
32 X 0.06062
From above three influence measures, we can say that none of the outliers is influential.
Variance Influence
27. 26 | P a g e
From above table,
Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models
with any (VIF) k >5. From above table (VIF) 1 = 1.04175, (VIF) 2 = 1.04175,
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 1.04175 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
Model II
LogtenY (No of followers) = 0.67727- 0.00000568No of tweets + 0.11367Years since they joined +
0.000118437No of photos uploaded.
Verifying Model Assumption
a) The current MLR model is reasonable
Yi = β0+β1xi1+β2xi2+εi where Yi No of followers , Xi1 : No of tweets, Xi2 Years since they joined , Xi3 No
of photos uploaded
28. 27 | P a g e
From above plots of residual vs each predictors (i.e. No of tweets, Years since they joined, No of photos
uploaded), we can say that there is no curvature at all in any graph. So we can say MLR model is
reasonable.
b) The residuals have constant variance
Residual vs logten Ŷ
There is no funnel shape, so variance is constant.
C) The residuals are normally distributed
Normal Probability plot
29. 28 | P a g e
It is not perfectly straight but normality is okay.
d) The residuals are uncorrelated: We have not collected data in time order.
e) Diagnostics
Outlying X observation
If hii value is large then observation i is consider x outlier and has high leverage.
A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are
consider as X outlier.
2p/n= (2*4)/39 = 0.2051. Now we will look which value is greater than 0.2051. By comparing, we got 3,
11, 15, 31 and 32 observation are greater than 0.2564. So observation 3,11,15,31 and 32 are x outliers.
30. 29 | P a g e
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test
Bonferroni outlier test
α=0.10, n=39, p = 4
|ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 34) = t (0.9987, 33) = 3.2504
To find Y outliers, we will look which observation value is greater than 3.2504.By comparing, we found
that there are no Y outliers.
So from above two results, we can say that there are total 5 X outliers.
Influence
After finding outlier with to X values, the next step is to look whether these X outliers are influential or not.
To check their influence, we need to perform influence measures. There are three influence measures 1)
DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of X outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (4/39) = 0.6405
Observation No Type of Outlier DFFITS Value from Table Remarks
3 X 0.1054 From comparison, we can say that
observation 11 and 32 have higher
value than 0.6405.These two
observation have influence on the
fitted value.
11 X 0.7406
15 X 0.0734
31 X 0.7877
32 X 0.2364
2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of intercept,
no of tweets, years since they joined and no of photos uploaded
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation No Type of
Outlier
DFBETAS value from table Remarks
Intercept X1 X2 X3 11th
observation
have influence on
No of tweets and
no of photos
uploaded
coefficient while
observation 31
have influence on
no of tweets.
3 X 0.0913 0.0612 0.0916 0.0236
11 X 0.0494 0.3676 0.0994 0.6727
15 X 0.0023 0.0404 0.0075 0.0051
31 X 0.0104 0.5898 0.0523 0.1557
32 X 0.1948 0.0965 0.1969 0.1382
31. 30 | P a g e
3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an
aggregate influence measure of the ith case on all n fitted values
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.8556
Observation No Type of Outlier Cook’s value form table
3 X 0.00286 None of the observation
is above F value, so we
can say that none of the
influential to the fitted
values.
11 X 0.13717
15 X 0.00139
31 X 0.15318
32 X 0.01433
From above three influence measures, we can say that X outliers are not that influential.
f) The predictors are not highly correlated with each other.
Variance Influence
Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models
with any (VIF) k >5. From above table, (VIF) 1 = 2.50902, (VIF) 2 = 1.04198, (VIF) 3= 2.55792.
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.0363 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
Selection of best model
Firstly, both the models are significant. The R2
value of Model 1 is 0.3238 and R2
for Model 2 is 0.3602.
There is a not much increase in the R2
when we add no of tweets(X1) in model II. The residual vs logtenY
for both the model are constant but the model 1 is showing more constant variance. The Normal probability
plot of Model 1 is quite straight than Model II. Influence of X outliers is more in the model II as compared
to model I. From the anova of Model I, all predictors are having p<α so all predictors are significant while
in Model II the no of tweets(X1) are not significant as p>α and t* is less among all predictors. So No of
tweets is not significant to add. It does not give additional information.
From this analysis, we have selected Model 1 as best model.
FINAL MULTIPLE LINEAR REGRESSION MODEL
After verifying model assumption and performing diagnostics, we are considering below model as our
final model. We can predict the no of followers from the Years since they joined and no of photos
uploaded. It means that whenever the no of followers increases, it mainly depends on the how long a
person is on twitter and secondly it depends on the no of photos he/she has uploaded.
32. 31 | P a g e
Fit of the model
logtenŶ = 0.66796 + 0.11439Years since thy joined + 0.00006297No of photos uploaded
There would be 0.11439 increase in no of followers when Years Since they joined increase by one year,
given that no of photos uploaded is held constant. There would be 0.00006297 increase in no of followers
when No of photos uploaded increase by 1no, given that years since they joined is held constant.
As VIF values for Years since they joined and No of photos uploaded is less 5 and (̅VIF) is 1.04175 is
slightly bigger than 1,so conclude that there won’t be any severe multicollinearity problem.
From the table, we can say that the p value for years since they joined is less than 0.05, so this predictor is
significant. The p value of no of photos uploaded is slightly greater than 0.05 so this predictor is marginally
significant, but if we change the value of α to 0.10, then no of photos uploaded becomes significant.
F test: To check whether regression is significant or not
H0: β1=β2=0
H1: not all β1, β2, = 0
Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05
F* = 8.62, F (0.95, 2, 36) = 3.2594
As 8.62 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is
significant.
Coefficient of Multiple determination R2
:R2
is 0.3238. It shows the fraction of variability in No of followers
is explained by the model with Years since they joined and No of photos uploaded.
Joint C.I for the parameters
The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g
are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using
bk ± B s{bk}, where B = t ( 1 – α /2g ; n – p);
33. 32 | P a g e
Now,
b0 = 0.0581118 b1 = 0.11439, b2 = 1.0925E-9
S {b0 } = sqrt ( 0.0581118) = 0.241063
S {b1} = sqrt (0.0012516) = 0.03537
S {b2} = sqrt (1.0925E-9) = 3.30454E-05
B = 2.47887
C.I for β0 : (-0.53945,0.655676) β1 : (0.026712,0.202068) β2 : (-8.19E-05,8.19E-05).
We are 95% confident that β0 is in (-0.53945, 0.655676) β1 is in (0.026712, 0.202068) β2 is in (-1.89E-05,
1.45E-04) simultaneously.
C.I, C.B and P.I at xh of interest
xT
h = (1 , 6.7 , 910 )
hnew,new value is smaller than the largest hii.As there wont be extrapolation, we can continue with this
xh values.
Confidence Interval: xh = (1.4398893, 1.5434771)
We are 95% confident that mean no of followers when Years since they joined (6.5) and No of photos
uploaded (910) lies between (1.4398893, 1.5434771) million
Prediction Interval: xh = (1.176267, 1.8070995)
We are 95% confident that whenever the years since they joined is 6.5 years and no of photos uploaded is
910,then the no of followers will lie between 1.1762 to 1.807 million. As the no of followers are in
millions.
Confidence band xh = (1.4167958, 1.5665706)
{b0}
{b1}
{b2}
34. 33 | P a g e
We are 95% confident that the region contains the entire regression surface overall combination of values
of the x variables.
Final discussion
Here, we perform multiple linear regression analysis to see how many predictors are helpful in predicting
the no of followers. For that, we started with a preliminary model with four predictors i.e. No of tweets,
Years since they joined, No of photos uploaded and Following back and checked for the model assumptions.
In our preliminary model, we had a non-constant variance in our model. To remove that we did the log
transformation. Even we checked the multicollinearity between the predictors. After satisfying the
assumptions with our transformed model, we checked if any of the interaction terms are to be added in the
model. We tried different interaction terms, out of which we found that Years since they joined *Following
back (X2X4) term interaction will be helpful to predict the number of followers the most. We added this
standardized interaction term to the model. We used model search procedures to find the best model. Our
best model is Number of followers =Y ears since they joined Number of photos and videos posted. This
model satisfies the variance constant, normality is okay and there is no serious multicollinearity problem.