SlideShare a Scribd company logo
1 of 34
Download to read offline
Multiple Linear
Regression Project
Guided By: Dr. Chen
The University of Texas at Arlington
Japan Shah & Vishrut Mehta 5/5/16
Applied Linear Regression-
IE 5318
1 | P a g e
Project Proposal
Problem statement:
In this project of simple linear regression analysis, we are trying to determine the relationship between one
response variable (Y) and four predictor variables (X1, X2, X3, X4). We want to determine how the different
values of all the predictor variables affect the value of the response variable. And among all predictor
variable, which variable shows the most linear relationship with the response.
The variables:
The variables are as follows
 Response Variable (Y): Number of the followers of a person on the Twitter (in millions)
 Predictor Variable (X1): Number of tweets posted by a particular person
 Predictor Variable (X2): Number of years passed since that person has joined the twitter
 Predictor Variable (X3): Number of photos and videos posted
 Predictor Variable (X4): Number of people that person is following back
The data collection process:
We have used the official site of Forbes to find the first 100 most followed people on the twitter
(http://www.forbes.com/sites/maddieberg/2015/06/29/twitters-most-followed-celebrities-retweets-dont-
always-mean-dollars/#35671f137ef3)
Also we are using twitter to collect the data for the response variable and each predictor variables.
(www.twitter.com)
We searched for these people on the twitter for their accounts and verified them as their original account
by looking at the symbol , which is the symbol for the official account of persons on the twitter. We
are using the first 40 most followed people amongst them.
Why modeling this data set would be meaningful?
As we all know how much social media is affecting the lives of people in today's world and it is constantly
evolving with the new social media sites. It has revolutionized how we look at the things. While some
people have made their way to success through their hard work and years of experience, the others have
made it through social media which also includes the same amount of hard work. Amongst all of them
today, the second most popular social media site is Twitter according to ebizmba.com
(http://www.ebizmba.com/articles/social-networking-websites). It is necessary to understand on which
factors the popularity of the people is driven. We are going to determine that among all these four predictor
variables, which variable affects the popularity of the person the most.
2 | P a g e
Matrix Scatterplot and Pairwise Correlation Coefficient.
3 | P a g e
The scatter plot matrix helps us to understand the relationship between the response variable i.e. No of
followers(Y) and various predictor variables (No of tweets(X1), Years since they joined(X2), No of photos
uploaded(X3) and Following back(X4).It also shows the scatterplot between the predictor variable such as
X1 vs X2, X1 vs X3, X1 vs X4, X2 vs X3, X2 vs X4 and X3 vs X4.
Correlation coefficient matrix is between response(Y) and predictor variable(X) and predictor(X) vs
predictor variable(X).The value ranges from 0 to 1 when correlation is between response and predictor. If
the value is greater than 0.7, equal to 0.5 and less than 0.3 is consider high, medium and low correlation
respectively. It is always good to have high (i.e. greater or equal to 0.7) correlation between response and
predictor variable.
Large correlation coefficient between predictors indicates multicollinearity. In Multicollinearity ,we
consider correlation coefficient should be between -1≤ r ≤ +1.If the correlation coefficient between two
predictor variable is greater than zero then high value of one predictor with high value of another predictor
and low values of one predictor with low value of another predictor. If the correlation coefficient is less
than zero then high value of one predictor occur with low values of another predictor and low values of one
predictor with high values of another predictor. It is not good to have high correlation coefficient between
two predictors because high correlation indicates severe multicollinearity. Multicollinearity can cause
increase in the variance of coefficient estimates and make estimates sensitive to the change in the model.
We always want correlation coefficient between predictors near zero.
Below is the discussion from the scatterplot and correlation matrix.
Y Vs X plots
No of followers and No of tweets (Y vs X1): Here, there is positive upward trend and the correlation is
0.14782. It is low correlation. No of followers and Years since they joined (Y vs X2): Here, there is
positive upward trend and the correlation is 0.51656. It is moderate correlation. No of followers and No of
photos uploaded(Y Vs X3): Here, there is a positive trend and the correlation is 0.32294.It is low
correlation. No of followers and following back(Y Vs X4): Here, there is positive upward trend and the
correlation is 0.47663. It is moderate correlation.
X vs X Plots
No of tweets vs Years since they joined(X1 vs X2): Here, there is a positive upward trend and correlation
coefficient is 0.14618, which is near to zero. These two predictors are very less correlated, which is good.
No of tweets vs No of photos uploaded(X1 vs X3): Here, there is a strong positive upward trend and
correlation coefficient is 0.77547.These two predictors are highly correlated, which is bad. No of tweets vs
following back(X1 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.18112,
which is near zero. These two predictors are very less correlated, which is good. Years since they joined
vs No of photos uploaded(X2 vs X3): Here, there is a positive upward trend and correlation coefficient is
0.20020, which is near zero. These two predictors are very less correlated, which is good. Years since they
joined vs following back(X2 vs X4): Here, there is a positive upward trend and correlation coefficient is
0.57997.These two predictors are moderately correlated, which is bad. No of photos vs following back(X3
vs X4): Here, there is positive upward trend and correlation coefficient is 0.23780, which is near zero.
These two predictors are less correlated, which is good. Overall we can say that there is a multicollinearity
problem.
Potential Complication
There is severe multicollinearity between X2 vs X3.
II PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL
4 | P a g e
The general linear regression model is
Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi
Where: β0, β1, β2 …… βp-1 are parameters, Xi1…...,Xip-1 are known constants, εi are independent N (0, σ2
) ,
i = 1,….n
Fitted model is Ŷ (No of followers) = -25.40410 – 0.00044178No of tweets + 8.15496Years since they
joined + 0.00877No of photos uploaded + 0.00003188Following back
Model Assumption
For model adequacy, we need to satisfy following model assumptions
a) The current MLR model is reasonable b) The residuals have constant variance.
c) The residuals are normally distributed. It is not required but desired. d) The residuals are uncorrelated
e) No outliers f) the predictors are not highly correlated with each other.
a) The current MLR model form is reasonable
Yi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi,
Where Yi is the No of followers ,β0, β1, β2, β3, β4 are the parameters Xi1: No of tweets,Xi2 Years since they
joined,Xi3 No of photos uploaded ,Xi4 Following back
5 | P a g e
From the above graph of residual vs each predictors i.e. No of tweets ,Years Since they joined ,No of photos
uploaded ,Following back, we can say that there is no curvature in all graph .This shows that MLR model
is reasonable.
b) The residuals have constant variance.
Residual vs Ŷ : This plots helps us to understand whether residual have constant variance or not.
From above graph, we can say that there is a funnel shape. So residuals have non constant variance.
Modified Levene Test: Test for Constancy of Error Variance.
6 | P a g e
F test
H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From above table, p=0.0006, which is less than 0.05. So we reject Ho. It is strong conclusion.
We conclude that error variance is not constant
Two Sample T test
Now, as variance is not constant, we will look into the unequal variance row.
H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From table, p=0.0482, which is less than 0.05.So we reject H0. We conclude that error variance is not
constant
c) The residuals are normally distributed
Normal probability plot: This plot helps us to understand whether residuals are normally distributed or not.
This plot is residual versus expected scores.
From above graph, we can say that residuals are right skewed. So residuals are not normal.
Test for Normality
7 | P a g e
H0: Normality is Okay H1: Normality is violated
Decision rule: ρ̂ < c (α, n) then Reject H0. α=0.10
From the critical values table c (α, n) = 0.977 and from above table ρ̂ = 0.92338, which is less than 0.972.
So we reject H0. It is a strong conclusion. We conclude that Normality is violated
d) The residuals are uncorrelated
Data were not collected in time order, so time plot is not relevant.
e) Outliers: There are no outliers
f) The predictors are not highly correlated with each other.
Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It
measures how much variances of the estimated regression coefficient are inflated as compared to when
predictors are not linearly related. It can found by regressing Xk on other p-2 predictors. Formula for VIF
is
(VIF) k = 1 / 1-Rk
2
where R2
k is coefficient of multiple determination.
Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with
any (VIF) k >5. (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274.
We can say that No of tweets, Years since they joined, No of photos uploaded and following back is 2.5,
1.51, 2.58, 1.54 times inflated as compared to when predictors are not linearly related. From above table,
we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much bigger than 1.So conclude
that serious multicollinearity is not a problem.
Transformation
There was a non-constant variance and non-normality of the residual, so to get the model assumption
satisfies, we need to perform transformation. Our main goal of transformation is to get non constant
variance. As no transformation can able to make normality okay. We started our variance stabilizing
transformations from weakest to strongest. Weakest is the sqrt transformation Ý = sqrt Y i.e. λ=0.5 and the
strongest is Ý = -1/Y, λ=-1.Sqrt root transformation could not satisfies our model assumption so we move
to log transformation (i.e. λ=0).
a) The current MLR model form is reasonable
8 | P a g e
LogtenYi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi
Where
LogtenYi: No of followers ,β0, β1, β2, β3, β4 are the parameters
Xi1: No of tweets, Xi2 Years since they joined
Xi3 : No of photos uploaded ,Xi4: Following back
From the above graph of residual vs each predictor i.e. No of tweets, Years since they joined No of photos
uploaded and following back, we can say that there is no curvature in all of the graphs. So MLR model is
reasonable.
b) The residuals have constant variance
Residual vs log10Ŷ
From above graph, we can say that there is no funnel shape, so variance is constant.
9 | P a g e
Modified Levene test
Here α=0.05, F test: H0: Variance is constant H1: Variance is not constant
Decision rule: p<α then Reject H0. α=0.05
From table, p=0.1910 which is greater than 0.05. So we fail to reject Ho. We conclude that variance is
constant. As variance is constant, we move to equal variance row of t test.
Two Sample T test: H0: Variance is constant H1: Variance is not constant
Decision rule: p<alpha then Reject H0. α=0.05
From table, p=0.0936, which is greater than 0.05.So we fail to reject H0. We conclude that variance is
constant
C) Normality plot and Normality test
Normality test
α=0.05 , H0: Normality is Okay H1: Normality is violated
10 | P a g e
Decision rule: ρ < c (α, n) then Reject H0.
From the critical values table c (α, n) = 0.972 and from the above table ρ = 0.974, which is greater than
0.972. So we fail to reject H0. It is a weak conclusion. We conclude that normality is okay. If we take
α=0.10, then it will fail the normality test but in multiple linear regression, normality assumption is not
required but desired, so we can move ahead with this result. The important assumption that need to be
satisfy is non constant variance and we have gained with log transformation so we have stopped
transformation at log.
d) Data were not collected in time order, so time plot is not relevant.
e) Diagnostic
Outlying X observation
The hat matrix is helpful in identifying the X outliers. The diagonal element of the hat matrix is helpful in
finding X outliers. The diagonal element hii values is between 0 and 1 and their sum is p. p is no of
parameters. hii is the distance between the Values of ith
case and mean of X values of n cases. The diagonal
element in this context is called leverage of X values.
If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is
consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier.
2p/n= (2*5)/39 = 0.2564. Now we will look which value is greater than 0.2564. By comparing, we got 3,
9,11,15,31 and 32 observation are greater than 0.2564. So observation 3, 9,11,15,31 and 32 are x outliers.
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test.
11 | P a g e
Bonferroni outlier test
α=0.10, n=39, p = 5
Bonferroni critical value = |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 33) = t (0.9987, 33) = 3.25817
To find Y outliers, we will look which observation value is greater than 3.25817.By comparing, we could
not found any Y outliers.
So from above two results, we found 3,9,11,15,31,32 as X outliers.
Influence
After finding outlier with respect to X values, the next step is to look whether these X outliers are influential
or not. To check their influence, we need to perform influence measures. There are three influence measures
1) DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (5/39) = 0.71611
Observation No Type of Outlier DFFITS Value from Table Remarks
3 X -1.3807 From comparison, we can say that
observation 3 have higher value
than 0.71611.This observation
have influence on the fitted value.
Of no of followers
9 X -0.3078
11 X -0.6020
15 X -0.0438
31 X -0.6945
32 X 0.0242
2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of Intercept,
No of tweets, Years since they joined, No of photos uploaded and Following back. Here we will look
absolute value of DFBETAS. A large absolute value of DFBETAS is consider as an influential.
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation
No
Type of
Outlier
DFBETAS value from table Remarks
Intercept X1 X2 X3 X4 Outlier 3 and 31
is influencing on
Following back
predictor
coefficient and
No of tweets
predictor
coefficient
respectively.
3 X 0.0599 0.1398 -0.0490 -0.0977 -1.0565
9 X -0.0173 0.0315 0.0061 0.0439 -0.2427
11 X 0.0519 0.2885 -0.0207 -0.5411 0.1529
15 X 0.0005 -0.0241 0.0021 -0.0033 0.0029
31 X 0.0824 -0.5061 -0.0539 0.1173 0.1629
32 X 0.0802 -0.0093 -0.0203 0.0124 0.0083
12 | P a g e
3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an
aggregate influence measure of the ith
case on all n fitted values. It is denoted as Di.
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.8878
Observation No Type of Outlier Cook’s value form table Remarks
3 X 0.21800 None of the observation
is above F value, so we
can say that none of the
observation is
influential on all n fitted
values of no of
followers
9 X 0.06657
11 X 0.08801
15 X 0.00886
31 X 0.08420
32 X 0.00127
f) Variance Inflation Factor
Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with
any (VIF) k >5
(VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
The preliminary model is
13 | P a g e
Log10No of Followers = 0.87514 – 0.00000570No of tweets + 0.8402Years since they joined +
0.00011227No of photos uploaded + 3.081619E-7Following back
The parameter estimates are b0: 0.87514, b1 : -0.00000570, b2: 0.08402, b3 : 0.00011227, b4: 3.081619E-7
There would be decrease in no of followers when no of tweets increase by one unit, given that other variable
are held constant. There would be 0.08402 increase in no of followers when Years Since they joined
increase by one year, given that other predictors variable are held constant. There would be 0.00011227
increase in no of followers when No of photos uploaded increase by 1no,given that other predictors are held
constant. There would be 3.08E-7 increase in no of followers when following back is increase by 1no, given
that other predictors are held constant.
There are 39 observation, so the degree of freedom for the corrected total is n-1 i.e. 38.We have 4 predictors,
so the degree of freedom for the model is 4. The degree of freedom for the error is n-p-1 i.e. 39-4-1 = 34.
Standard error: Standard error are the standard error for the regression coefficient. It helps us for
constructing confidence interval.
Sum of Square: Sum of square is formed by SST (Sum of square total) = SSM (Sum of square model) +
SSE (Sum of square error). SST shows the total variation in the response. Sum of square error shows the
unexplained variation within the no of followers i.e. yi-̂yi .We can say that this variation is due to deviation
in the model and Sum of square model shows the explained variability i.e. ̂yi - ȳ . This shows variation due
to model.
Mean Square: It is calculated as ratio of sum of squares and its corresponding degree of freedom. Mean
square error is an estimate of the variance σ2
for our model. Value of MSE is 0.02251.
Root MSE: value is 0.15004.It shows the value of s, estimate for the parameter σ of our model.
Departure Mean: The value is 1.51431, this indicates the mean of logtenY.
F value: It is the test statistics to check whether the model is significant or not.
H0: β1=β2=β3=β4 = 0
H1: not all β1, β2, β3, β4 = 0
Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05
F* = 5.41, F (0.95, 4, 35) = 2.6414
As 5.41 > 2.6414, we reject Ho and conclude that not all β1, β2, β3, β4 is zero. In other words, regression is
significant.
Coefficient of Multiple Determination (R2
). From the table, the R2
value is 0.3891 fraction of variability
in No of followers explained by the No of tweets, Years Since they joined, No of photos uploaded, following
back.
Significance of Predictors
T value: It is the t statistics. It is a ratio of the parameter estimates to the standard error. The null hypothesis
is regression coefficient is zero. If the regression coefficient of predictor is zero, then it does not
significantly contribute to the model. We can drop the predictors when there is t* is less among all predictors
14 | P a g e
Pr>|t| are two sided. Examination of t statistics and its corresponding p value helps us to find significance.
The p value for the No of tweets(X1) and following back(X4) is greater than α (0.05), so both these predictor
are not significant for predicting no of followers. The p value for the Years Since they joined (X2) is almost
equal to 0.05, so this predictor is significant and No of photos uploaded(X3) is less than to α (0.05), so both
these predictors are significant in predicting no of followers.
III Exploration of Interaction Terms using Partial Regression Plot
Partial regression plot is also called as Added-variable plot. Partial regression plot helps us to understand the
marginal role of the interaction term given other predictor variables are present. From this plot, we will
understand which interaction will be helpful to predict no of followers(Y). For selecting the graph, we need to
see trend i.e positive upward or negative. If there is no trend, then we should not add interaction.
The following interaction terms are possible
1) X1X2: No of tweets Year since they joined 2) X1X3: No of tweets No of photos uploaded
3) X1X4: No of tweets Following back 4) X2X3: Year Since they joined No of photos uploaded
5) X2X4: Year Since they joined Following back 6) X3X4: No of photos uploaded Following back
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x2 given X1, X2, X3,
X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there
is horizontal band, so this interaction does not contain additional information that is useful for predicting
no of followers(Y).
15 | P a g e
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x1x3 given X1,
X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks
like there is horizontal band, so this interaction does not contain additional information that is useful for
predicting no of followers(Y).
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x4 given X1, X2, X3,
X4. From the plot, we can say that trend is positive, so this interaction may be helpful in predicting no of
followers.
16 | P a g e
Here, we have regress residual of logtenY given X1, X2, X3,X4 versus the residual of x2x3 given X1, X2, X3,
X4. From the plot, we can say that there is no trend so this interaction will not provide any additional
information for predicting No of followers.
Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x2x4 given X1,
X2, X3, X4. From the plot, we can say that trend is negative so this interaction will be helpful in predicting
no of followers
17 | P a g e
Here, we have regress residual of logtenY given X1, X2, X3, X4 versus the residual of model x3x4 given X1,
X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be useful in predicting
no of followers.
From above interaction plots, the X1X2 and X1X3 plots have slightly negative trend so these two
predictors won’t be helpful. The X2X3 plots having no trend so no need to add in the model. Between
X1X4,X2X4 and X3X4, the X2X4 plot have more scatter around the regression line. So X2X4 is enough to
add as an interaction term.
Correlation involving the added interaction term before and after standardization. Standardization
is important for the model having the interaction and the polynomial terms. Standardizing the predictors
helps us to reduce the multicollinearity. In standardization, the standardized variable can be calculated by
centering the mean to zero and scaling the variance to 1. Centering the predictors is important for the
interaction terms.
Before Standardization
After Standardization
18 | P a g e
From above results, we can see the effect of standardization. It helps to reduce the multicollinearity. From
after standardization correlation matrix, we can say that correlation coefficient between No of tweets, Years
Since they joined, No of photos and following back and Years since they joined*Following back has
reduced.
IV. Model Search
For the 4 predictor variable, the no of parameter would be p=5 and the total no of model would be 2p-1
.So
in our model the total no of model would be 24
= 16. It is difficult to access all the models, so to deal with
such complexity, we have three model search procedure and they are a) Best Subset b) Backward deletion
c) Stepwise
a) Best Subset: For selection of the model, there are three criteria need to check for potentially best model
1) Cp 2) AIC (Akaike’s Information Criterion) 3) SBC (Schwarz Bayesian Criterion)
Firstly, we will look R2
a in the model. At one stage Ra2
value will be level off. We will discard the model
whose Ra
2
decreased from previous step. Now we will use our mentioned criteria to find our model. When
Cp=p (no of parameters), then there won’t be any bias.
Model 1
19 | P a g e
Number in
Model
R2
a
R2 Cp AIC SBC
1 0.2356 0.2557 8.7531 -141.5964 -138.26926
2 0.2863 0.3238 6.7459 -143.3425 -138.35179
3 0.3053 0.3602 6.6100 -143.4968 -136.84251
4 0.3173 0.3891 6.9077 -143.3032 -134.09534
5 0.3535 0.4386 6 -144.5965 -134.61512
By using mentioned criteria, the min Cp and min AIC is of Model 3 and min SBC is of Model 2.Though,
the model 5 is having min Cp and min AIC, we have not consider this model because there is severe
multicollinearity problem.
From this method, we have selected following models
Model 1: logtenY (No of followers) = β0+β1 Years since they joined+β2 No of photos uploaded
Model 2: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos
uploaded
b) Backward deletion
Model 2
20 | P a g e
From backward deletion, we have selected following models
Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos
uploaded.
C) Stepwise
Model 1
21 | P a g e
From stepwise, we have selected following models
Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of
photos uploaded.
From above model search procedure, we have selected following two models as our best models.
Model I: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of
photos uploaded.
Model II: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos
uploaded
Model 1
22 | P a g e
V. Model Selection
Model I
LogtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos
uploaded.
Checking Model assumption
a) The current MLR model is reasonable
Yi = β0+β1xi1+β2xi2+εi
Where , Yi No of followers,Xi1 : Years since they joined Xi2 No of photos uploaded
From above graph, we can say that there is no curvature at all in any graph. So MLR model is reasonable.
b) The residuals have constant variance
Residual vs logtenŶ
23 | P a g e
From the graph, we conclude that graph have constant variance as there is no funnel shape
c) The residuals are normally distributed
Normal Probability Plot
From the graph, we conclude that graph is normal. Normality is satisfied
d) The residual are uncorrelated. We have not collected data in time order.
e) Diagnostic
24 | P a g e
Outlying X observation
If hii value is large then observation i is consider x outlier and has high leverage.
A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are
consider as X outlier.
2p/n= (2*3)/39 = 0.1538. Now we will look which value is greater than 0.1538. By comparing, we got 3,
11, 15 and 32 observation are greater than 0.1538, so observation 3, 11, 15 and 32 are x outliers.
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test.
Bonferroni outlier test
α=0.10, n=39, p = 3
|ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 35) = t (0.9987, 33) = 3.2431
To find Y outliers, we will look which observation value is greater than 3.2431.By comparing, we found
all observation is less than 3.25817, so there are no Y outliers.
So from above two results, we can say that there are total 3 X outliers.
Influence
25 | P a g e
After finding outlier from X values, the next step is to look whether these outliers are influential or not. To
check their influence, we need to perform influence measures. There are three influence measures 1)
DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (3/39) = 0.5547
Observation No Type of Outlier DFFITS Value from Table Conclusion
3 X 0.1763 From comparison, we can say that
none of the observation have
influence on the fitted value.
11 X 0.2651
15 X 0.2602
32 X 0.4234
2) DFBETAS: It is helpful to understand influence of outliers on the regression coefficient of intercept,
years since they joined and no of photos uploaded.
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation no Type of Outlier DFBETAS value from table
Intercept X2 X3 3, 11, and 15 are
not influential
while observation
32 is slightly
influential on
intercept and
coefficient of years
since they joined
3 X 0.1554 0.1554 0.0297
11 X 0.0162 0.0388 0.2518
15 X 0.0147 0.0345 0.2410
32 X 0.3771 0.3836 0.1965
3) COOKDISTANCE: It is helpful to understand influence of outliers on all n fitted values. It is an
aggregate influence measure of the ith case on all n fitted values
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.80381
Observation No Type of Outlier Cook’s value form table Remarks
3 X 0.07852 None of the observation
is influential on fitted
values.
11 X 0.02396
15 X 0.02301
32 X 0.06062
From above three influence measures, we can say that none of the outliers is influential.
Variance Influence
26 | P a g e
From above table,
Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models
with any (VIF) k >5. From above table (VIF) 1 = 1.04175, (VIF) 2 = 1.04175,
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 1.04175 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
Model II
LogtenY (No of followers) = 0.67727- 0.00000568No of tweets + 0.11367Years since they joined +
0.000118437No of photos uploaded.
Verifying Model Assumption
a) The current MLR model is reasonable
Yi = β0+β1xi1+β2xi2+εi where Yi No of followers , Xi1 : No of tweets, Xi2 Years since they joined , Xi3 No
of photos uploaded
27 | P a g e
From above plots of residual vs each predictors (i.e. No of tweets, Years since they joined, No of photos
uploaded), we can say that there is no curvature at all in any graph. So we can say MLR model is
reasonable.
b) The residuals have constant variance
Residual vs logten Ŷ
There is no funnel shape, so variance is constant.
C) The residuals are normally distributed
Normal Probability plot
28 | P a g e
It is not perfectly straight but normality is okay.
d) The residuals are uncorrelated: We have not collected data in time order.
e) Diagnostics
Outlying X observation
If hii value is large then observation i is consider x outlier and has high leverage.
A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are
consider as X outlier.
2p/n= (2*4)/39 = 0.2051. Now we will look which value is greater than 0.2051. By comparing, we got 3,
11, 15, 31 and 32 observation are greater than 0.2564. So observation 3,11,15,31 and 32 are x outliers.
29 | P a g e
Outlying Y observation
We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value.
Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will
perform Bonferroni Outlier test
Bonferroni outlier test
α=0.10, n=39, p = 4
|ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 34) = t (0.9987, 33) = 3.2504
To find Y outliers, we will look which observation value is greater than 3.2504.By comparing, we found
that there are no Y outliers.
So from above two results, we can say that there are total 5 X outliers.
Influence
After finding outlier with to X values, the next step is to look whether these X outliers are influential or not.
To check their influence, we need to perform influence measures. There are three influence measures 1)
DFFITS 2) DFBETAS 3) COOKDI
1) DFFITS: It is helpful to understand influence of X outliers on the fitted value Ŷ (No of followers)
Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential.
2*sqrt (4/39) = 0.6405
Observation No Type of Outlier DFFITS Value from Table Remarks
3 X 0.1054 From comparison, we can say that
observation 11 and 32 have higher
value than 0.6405.These two
observation have influence on the
fitted value.
11 X 0.7406
15 X 0.0734
31 X 0.7877
32 X 0.2364
2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of intercept,
no of tweets, years since they joined and no of photos uploaded
Guideline |DFBETAS| > 2/ (sqrt n) is consider influential.
2/ (sqrt 39) = 0.3202
Observation No Type of
Outlier
DFBETAS value from table Remarks
Intercept X1 X2 X3 11th
observation
have influence on
No of tweets and
no of photos
uploaded
coefficient while
observation 31
have influence on
no of tweets.
3 X 0.0913 0.0612 0.0916 0.0236
11 X 0.0494 0.3676 0.0994 0.6727
15 X 0.0023 0.0404 0.0075 0.0051
31 X 0.0104 0.5898 0.0523 0.1557
32 X 0.1948 0.0965 0.1969 0.1382
30 | P a g e
3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an
aggregate influence measure of the ith case on all n fitted values
Guideline Di > F (0.50, p, n-p) then observation i is said to be influential.
F (0.50, 5, 34) = 0.8556
Observation No Type of Outlier Cook’s value form table
3 X 0.00286 None of the observation
is above F value, so we
can say that none of the
influential to the fitted
values.
11 X 0.13717
15 X 0.00139
31 X 0.15318
32 X 0.01433
From above three influence measures, we can say that X outliers are not that influential.
f) The predictors are not highly correlated with each other.
Variance Influence
Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models
with any (VIF) k >5. From above table, (VIF) 1 = 2.50902, (VIF) 2 = 1.04198, (VIF) 3= 2.55792.
So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.0363 is not much
bigger than 1.So conclude that serious multicollinearity is not a problem.
Selection of best model
Firstly, both the models are significant. The R2
value of Model 1 is 0.3238 and R2
for Model 2 is 0.3602.
There is a not much increase in the R2
when we add no of tweets(X1) in model II. The residual vs logtenY
for both the model are constant but the model 1 is showing more constant variance. The Normal probability
plot of Model 1 is quite straight than Model II. Influence of X outliers is more in the model II as compared
to model I. From the anova of Model I, all predictors are having p<α so all predictors are significant while
in Model II the no of tweets(X1) are not significant as p>α and t* is less among all predictors. So No of
tweets is not significant to add. It does not give additional information.
From this analysis, we have selected Model 1 as best model.
FINAL MULTIPLE LINEAR REGRESSION MODEL
After verifying model assumption and performing diagnostics, we are considering below model as our
final model. We can predict the no of followers from the Years since they joined and no of photos
uploaded. It means that whenever the no of followers increases, it mainly depends on the how long a
person is on twitter and secondly it depends on the no of photos he/she has uploaded.
31 | P a g e
Fit of the model
logtenŶ = 0.66796 + 0.11439Years since thy joined + 0.00006297No of photos uploaded
There would be 0.11439 increase in no of followers when Years Since they joined increase by one year,
given that no of photos uploaded is held constant. There would be 0.00006297 increase in no of followers
when No of photos uploaded increase by 1no, given that years since they joined is held constant.
As VIF values for Years since they joined and No of photos uploaded is less 5 and (̅VIF) is 1.04175 is
slightly bigger than 1,so conclude that there won’t be any severe multicollinearity problem.
From the table, we can say that the p value for years since they joined is less than 0.05, so this predictor is
significant. The p value of no of photos uploaded is slightly greater than 0.05 so this predictor is marginally
significant, but if we change the value of α to 0.10, then no of photos uploaded becomes significant.
F test: To check whether regression is significant or not
H0: β1=β2=0
H1: not all β1, β2, = 0
Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05
F* = 8.62, F (0.95, 2, 36) = 3.2594
As 8.62 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is
significant.
Coefficient of Multiple determination R2
:R2
is 0.3238. It shows the fraction of variability in No of followers
is explained by the model with Years since they joined and No of photos uploaded.
Joint C.I for the parameters
The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g
are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using
bk ± B s{bk}, where B = t ( 1 – α /2g ; n – p);
32 | P a g e
Now,
b0 = 0.0581118 b1 = 0.11439, b2 = 1.0925E-9
S {b0 } = sqrt ( 0.0581118) = 0.241063
S {b1} = sqrt (0.0012516) = 0.03537
S {b2} = sqrt (1.0925E-9) = 3.30454E-05
B = 2.47887
C.I for β0 : (-0.53945,0.655676) β1 : (0.026712,0.202068) β2 : (-8.19E-05,8.19E-05).
We are 95% confident that β0 is in (-0.53945, 0.655676) β1 is in (0.026712, 0.202068) β2 is in (-1.89E-05,
1.45E-04) simultaneously.
C.I, C.B and P.I at xh of interest
xT
h = (1 , 6.7 , 910 )
hnew,new value is smaller than the largest hii.As there wont be extrapolation, we can continue with this
xh values.
Confidence Interval: xh = (1.4398893, 1.5434771)
We are 95% confident that mean no of followers when Years since they joined (6.5) and No of photos
uploaded (910) lies between (1.4398893, 1.5434771) million
Prediction Interval: xh = (1.176267, 1.8070995)
We are 95% confident that whenever the years since they joined is 6.5 years and no of photos uploaded is
910,then the no of followers will lie between 1.1762 to 1.807 million. As the no of followers are in
millions.
Confidence band xh = (1.4167958, 1.5665706)
{b0}
{b1}
{b2}
33 | P a g e
We are 95% confident that the region contains the entire regression surface overall combination of values
of the x variables.
Final discussion
Here, we perform multiple linear regression analysis to see how many predictors are helpful in predicting
the no of followers. For that, we started with a preliminary model with four predictors i.e. No of tweets,
Years since they joined, No of photos uploaded and Following back and checked for the model assumptions.
In our preliminary model, we had a non-constant variance in our model. To remove that we did the log
transformation. Even we checked the multicollinearity between the predictors. After satisfying the
assumptions with our transformed model, we checked if any of the interaction terms are to be added in the
model. We tried different interaction terms, out of which we found that Years since they joined *Following
back (X2X4) term interaction will be helpful to predict the number of followers the most. We added this
standardized interaction term to the model. We used model search procedures to find the best model. Our
best model is Number of followers =Y ears since they joined Number of photos and videos posted. This
model satisfies the variance constant, normality is okay and there is no serious multicollinearity problem.

More Related Content

What's hot

Regression analysis
Regression analysisRegression analysis
Regression analysisSohag Babu
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelSetia Pramana
 
Covariance and correlation(Dereje JIMA)
Covariance and correlation(Dereje JIMA)Covariance and correlation(Dereje JIMA)
Covariance and correlation(Dereje JIMA)Dereje Jima
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear RegressionIndus University
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxVishalLabde
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)Harsh Upadhyay
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisYabebal Ayalew
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepDan Wellisch
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regressionKen Plummer
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regressionKen Plummer
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apaKen Plummer
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...Smarten Augmented Analytics
 
Reporting single sample z-test for proportions
Reporting single sample z-test for proportionsReporting single sample z-test for proportions
Reporting single sample z-test for proportionsKen Plummer
 
Confirmatory factor analysis (cfa)
Confirmatory factor analysis (cfa)Confirmatory factor analysis (cfa)
Confirmatory factor analysis (cfa)HennaAnsari
 

What's hot (20)

Regression
RegressionRegression
Regression
 
7 steps to Predictive Analytics
7 steps to Predictive Analytics 7 steps to Predictive Analytics
7 steps to Predictive Analytics
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Covariance and correlation(Dereje JIMA)
Covariance and correlation(Dereje JIMA)Covariance and correlation(Dereje JIMA)
Covariance and correlation(Dereje JIMA)
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Covariance vs Correlation
Covariance vs CorrelationCovariance vs Correlation
Covariance vs Correlation
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regression
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apa
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
 
Choice Models
Choice ModelsChoice Models
Choice Models
 
Reporting single sample z-test for proportions
Reporting single sample z-test for proportionsReporting single sample z-test for proportions
Reporting single sample z-test for proportions
 
Confirmatory factor analysis (cfa)
Confirmatory factor analysis (cfa)Confirmatory factor analysis (cfa)
Confirmatory factor analysis (cfa)
 
Linear regression
Linear regressionLinear regression
Linear regression
 

Viewers also liked

econometrics project PG1 2015-16
econometrics project PG1 2015-16econometrics project PG1 2015-16
econometrics project PG1 2015-16Sayantan Baidya
 
Regression analysis project
Regression analysis projectRegression analysis project
Regression analysis projectMAS261
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics ProjectUday Tharar
 
Seasonal modeling in time series with R
Seasonal modeling in time series with RSeasonal modeling in time series with R
Seasonal modeling in time series with Rmahmuda27
 
Time Series Project
Time Series Project Time Series Project
Time Series Project Sean Cahill
 
Time Series Project
Time Series ProjectTime Series Project
Time Series ProjectJason Eber
 
Project time series ppt
Project time series pptProject time series ppt
Project time series pptamar patil
 
Brandie and Michaella's Project
Brandie and Michaella's ProjectBrandie and Michaella's Project
Brandie and Michaella's ProjectLacilia0024
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleMarjan Sterjev
 
Research project presentation
Research project presentationResearch project presentation
Research project presentationLucia Morales
 
Using Singular Spectrum Analysis to Model Electricity Prices
Using Singular Spectrum Analysis  to Model Electricity PricesUsing Singular Spectrum Analysis  to Model Electricity Prices
Using Singular Spectrum Analysis to Model Electricity PricesNicolasRR
 
MKT6337_FinalPPT_1
MKT6337_FinalPPT_1MKT6337_FinalPPT_1
MKT6337_FinalPPT_1Ishan Dua
 
EC4417 Econometrics Project
EC4417 Econometrics ProjectEC4417 Econometrics Project
EC4417 Econometrics ProjectLonan Carroll
 
Topic 18 multiple regression
Topic 18 multiple regressionTopic 18 multiple regression
Topic 18 multiple regressionSizwan Ahammed
 
Linear regression
Linear regressionLinear regression
Linear regressionansrivas21
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlationShakeel Nouman
 
Econometrics Final Project
Econometrics Final ProjectEconometrics Final Project
Econometrics Final ProjectAliaksey Narko
 

Viewers also liked (20)

econometrics project PG1 2015-16
econometrics project PG1 2015-16econometrics project PG1 2015-16
econometrics project PG1 2015-16
 
Econometrics project final edited
Econometrics project final editedEconometrics project final edited
Econometrics project final edited
 
Regression analysis project
Regression analysis projectRegression analysis project
Regression analysis project
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project
 
Seasonal modeling in time series with R
Seasonal modeling in time series with RSeasonal modeling in time series with R
Seasonal modeling in time series with R
 
Time Series Project
Time Series Project Time Series Project
Time Series Project
 
Time Series Project
Time Series ProjectTime Series Project
Time Series Project
 
Project time series ppt
Project time series pptProject time series ppt
Project time series ppt
 
Brandie and Michaella's Project
Brandie and Michaella's ProjectBrandie and Michaella's Project
Brandie and Michaella's Project
 
Statistics Project Report
Statistics Project ReportStatistics Project Report
Statistics Project Report
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation Example
 
Research project presentation
Research project presentationResearch project presentation
Research project presentation
 
SSA slides
SSA slidesSSA slides
SSA slides
 
Using Singular Spectrum Analysis to Model Electricity Prices
Using Singular Spectrum Analysis  to Model Electricity PricesUsing Singular Spectrum Analysis  to Model Electricity Prices
Using Singular Spectrum Analysis to Model Electricity Prices
 
MKT6337_FinalPPT_1
MKT6337_FinalPPT_1MKT6337_FinalPPT_1
MKT6337_FinalPPT_1
 
EC4417 Econometrics Project
EC4417 Econometrics ProjectEC4417 Econometrics Project
EC4417 Econometrics Project
 
Topic 18 multiple regression
Topic 18 multiple regressionTopic 18 multiple regression
Topic 18 multiple regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlation
 
Econometrics Final Project
Econometrics Final ProjectEconometrics Final Project
Econometrics Final Project
 

Similar to Multiple Linear Regression Model Predicts Twitter Followers

MLR Project (Onion)
MLR Project (Onion)MLR Project (Onion)
MLR Project (Onion)Chawal Ukesh
 
Econometrics and statistics mcqs part 2
Econometrics and statistics mcqs part 2Econometrics and statistics mcqs part 2
Econometrics and statistics mcqs part 2punjab university
 
Mba i qt unit-3_correlation
Mba i qt unit-3_correlationMba i qt unit-3_correlation
Mba i qt unit-3_correlationRai University
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxhyacinthshackley2629
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxnovabroom
 
Data analysis test for association BY Prof Sachin Udepurkar
Data analysis   test for association BY Prof Sachin UdepurkarData analysis   test for association BY Prof Sachin Udepurkar
Data analysis test for association BY Prof Sachin Udepurkarsachinudepurkar
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- CompleteIrfan Yaqoob
 
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Daniel Katz
 
TTests.ppt
TTests.pptTTests.ppt
TTests.pptMUzair21
 
Correlation: Bivariate Data and Scatter Plot
Correlation: Bivariate Data and Scatter PlotCorrelation: Bivariate Data and Scatter Plot
Correlation: Bivariate Data and Scatter PlotDenzelMontuya1
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysisFarzad Javidanrad
 
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docx
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docxDoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docx
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docxkanepbyrne80830
 
1. A small accounting firm pays each of its five clerks $35,000, t.docx
1. A small accounting firm pays each of its five clerks $35,000, t.docx1. A small accounting firm pays each of its five clerks $35,000, t.docx
1. A small accounting firm pays each of its five clerks $35,000, t.docxSONU61709
 
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlationRashid Hussain
 

Similar to Multiple Linear Regression Model Predicts Twitter Followers (20)

MLR Project (Onion)
MLR Project (Onion)MLR Project (Onion)
MLR Project (Onion)
 
Econometrics and statistics mcqs part 2
Econometrics and statistics mcqs part 2Econometrics and statistics mcqs part 2
Econometrics and statistics mcqs part 2
 
Mba i qt unit-3_correlation
Mba i qt unit-3_correlationMba i qt unit-3_correlation
Mba i qt unit-3_correlation
 
Applied statistics part 4
Applied statistics part  4Applied statistics part  4
Applied statistics part 4
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
Data analysis test for association BY Prof Sachin Udepurkar
Data analysis   test for association BY Prof Sachin UdepurkarData analysis   test for association BY Prof Sachin Udepurkar
Data analysis test for association BY Prof Sachin Udepurkar
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- Complete
 
Correlation
CorrelationCorrelation
Correlation
 
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
Quantitative Methods for Lawyers - Class #17 - Scatter Plots, Covariance, Cor...
 
TTests.ppt
TTests.pptTTests.ppt
TTests.ppt
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation: Bivariate Data and Scatter Plot
Correlation: Bivariate Data and Scatter PlotCorrelation: Bivariate Data and Scatter Plot
Correlation: Bivariate Data and Scatter Plot
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysis
 
data analysis
data analysisdata analysis
data analysis
 
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docx
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docxDoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docx
DoW #6 TVs and Life ExpectanciesFor this weeks DoW, you wi.docx
 
1. A small accounting firm pays each of its five clerks $35,000, t.docx
1. A small accounting firm pays each of its five clerks $35,000, t.docx1. A small accounting firm pays each of its five clerks $35,000, t.docx
1. A small accounting firm pays each of its five clerks $35,000, t.docx
 
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlation
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 

Recently uploaded

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Recently uploaded (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Multiple Linear Regression Model Predicts Twitter Followers

  • 1. Multiple Linear Regression Project Guided By: Dr. Chen The University of Texas at Arlington Japan Shah & Vishrut Mehta 5/5/16 Applied Linear Regression- IE 5318
  • 2. 1 | P a g e Project Proposal Problem statement: In this project of simple linear regression analysis, we are trying to determine the relationship between one response variable (Y) and four predictor variables (X1, X2, X3, X4). We want to determine how the different values of all the predictor variables affect the value of the response variable. And among all predictor variable, which variable shows the most linear relationship with the response. The variables: The variables are as follows  Response Variable (Y): Number of the followers of a person on the Twitter (in millions)  Predictor Variable (X1): Number of tweets posted by a particular person  Predictor Variable (X2): Number of years passed since that person has joined the twitter  Predictor Variable (X3): Number of photos and videos posted  Predictor Variable (X4): Number of people that person is following back The data collection process: We have used the official site of Forbes to find the first 100 most followed people on the twitter (http://www.forbes.com/sites/maddieberg/2015/06/29/twitters-most-followed-celebrities-retweets-dont- always-mean-dollars/#35671f137ef3) Also we are using twitter to collect the data for the response variable and each predictor variables. (www.twitter.com) We searched for these people on the twitter for their accounts and verified them as their original account by looking at the symbol , which is the symbol for the official account of persons on the twitter. We are using the first 40 most followed people amongst them. Why modeling this data set would be meaningful? As we all know how much social media is affecting the lives of people in today's world and it is constantly evolving with the new social media sites. It has revolutionized how we look at the things. While some people have made their way to success through their hard work and years of experience, the others have made it through social media which also includes the same amount of hard work. Amongst all of them today, the second most popular social media site is Twitter according to ebizmba.com (http://www.ebizmba.com/articles/social-networking-websites). It is necessary to understand on which factors the popularity of the people is driven. We are going to determine that among all these four predictor variables, which variable affects the popularity of the person the most.
  • 3. 2 | P a g e Matrix Scatterplot and Pairwise Correlation Coefficient.
  • 4. 3 | P a g e The scatter plot matrix helps us to understand the relationship between the response variable i.e. No of followers(Y) and various predictor variables (No of tweets(X1), Years since they joined(X2), No of photos uploaded(X3) and Following back(X4).It also shows the scatterplot between the predictor variable such as X1 vs X2, X1 vs X3, X1 vs X4, X2 vs X3, X2 vs X4 and X3 vs X4. Correlation coefficient matrix is between response(Y) and predictor variable(X) and predictor(X) vs predictor variable(X).The value ranges from 0 to 1 when correlation is between response and predictor. If the value is greater than 0.7, equal to 0.5 and less than 0.3 is consider high, medium and low correlation respectively. It is always good to have high (i.e. greater or equal to 0.7) correlation between response and predictor variable. Large correlation coefficient between predictors indicates multicollinearity. In Multicollinearity ,we consider correlation coefficient should be between -1≤ r ≤ +1.If the correlation coefficient between two predictor variable is greater than zero then high value of one predictor with high value of another predictor and low values of one predictor with low value of another predictor. If the correlation coefficient is less than zero then high value of one predictor occur with low values of another predictor and low values of one predictor with high values of another predictor. It is not good to have high correlation coefficient between two predictors because high correlation indicates severe multicollinearity. Multicollinearity can cause increase in the variance of coefficient estimates and make estimates sensitive to the change in the model. We always want correlation coefficient between predictors near zero. Below is the discussion from the scatterplot and correlation matrix. Y Vs X plots No of followers and No of tweets (Y vs X1): Here, there is positive upward trend and the correlation is 0.14782. It is low correlation. No of followers and Years since they joined (Y vs X2): Here, there is positive upward trend and the correlation is 0.51656. It is moderate correlation. No of followers and No of photos uploaded(Y Vs X3): Here, there is a positive trend and the correlation is 0.32294.It is low correlation. No of followers and following back(Y Vs X4): Here, there is positive upward trend and the correlation is 0.47663. It is moderate correlation. X vs X Plots No of tweets vs Years since they joined(X1 vs X2): Here, there is a positive upward trend and correlation coefficient is 0.14618, which is near to zero. These two predictors are very less correlated, which is good. No of tweets vs No of photos uploaded(X1 vs X3): Here, there is a strong positive upward trend and correlation coefficient is 0.77547.These two predictors are highly correlated, which is bad. No of tweets vs following back(X1 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.18112, which is near zero. These two predictors are very less correlated, which is good. Years since they joined vs No of photos uploaded(X2 vs X3): Here, there is a positive upward trend and correlation coefficient is 0.20020, which is near zero. These two predictors are very less correlated, which is good. Years since they joined vs following back(X2 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.57997.These two predictors are moderately correlated, which is bad. No of photos vs following back(X3 vs X4): Here, there is positive upward trend and correlation coefficient is 0.23780, which is near zero. These two predictors are less correlated, which is good. Overall we can say that there is a multicollinearity problem. Potential Complication There is severe multicollinearity between X2 vs X3. II PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL
  • 5. 4 | P a g e The general linear regression model is Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi Where: β0, β1, β2 …… βp-1 are parameters, Xi1…...,Xip-1 are known constants, εi are independent N (0, σ2 ) , i = 1,….n Fitted model is Ŷ (No of followers) = -25.40410 – 0.00044178No of tweets + 8.15496Years since they joined + 0.00877No of photos uploaded + 0.00003188Following back Model Assumption For model adequacy, we need to satisfy following model assumptions a) The current MLR model is reasonable b) The residuals have constant variance. c) The residuals are normally distributed. It is not required but desired. d) The residuals are uncorrelated e) No outliers f) the predictors are not highly correlated with each other. a) The current MLR model form is reasonable Yi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi, Where Yi is the No of followers ,β0, β1, β2, β3, β4 are the parameters Xi1: No of tweets,Xi2 Years since they joined,Xi3 No of photos uploaded ,Xi4 Following back
  • 6. 5 | P a g e From the above graph of residual vs each predictors i.e. No of tweets ,Years Since they joined ,No of photos uploaded ,Following back, we can say that there is no curvature in all graph .This shows that MLR model is reasonable. b) The residuals have constant variance. Residual vs Ŷ : This plots helps us to understand whether residual have constant variance or not. From above graph, we can say that there is a funnel shape. So residuals have non constant variance. Modified Levene Test: Test for Constancy of Error Variance.
  • 7. 6 | P a g e F test H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From above table, p=0.0006, which is less than 0.05. So we reject Ho. It is strong conclusion. We conclude that error variance is not constant Two Sample T test Now, as variance is not constant, we will look into the unequal variance row. H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From table, p=0.0482, which is less than 0.05.So we reject H0. We conclude that error variance is not constant c) The residuals are normally distributed Normal probability plot: This plot helps us to understand whether residuals are normally distributed or not. This plot is residual versus expected scores. From above graph, we can say that residuals are right skewed. So residuals are not normal. Test for Normality
  • 8. 7 | P a g e H0: Normality is Okay H1: Normality is violated Decision rule: ρ̂ < c (α, n) then Reject H0. α=0.10 From the critical values table c (α, n) = 0.977 and from above table ρ̂ = 0.92338, which is less than 0.972. So we reject H0. It is a strong conclusion. We conclude that Normality is violated d) The residuals are uncorrelated Data were not collected in time order, so time plot is not relevant. e) Outliers: There are no outliers f) The predictors are not highly correlated with each other. Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly related. It can found by regressing Xk on other p-2 predictors. Formula for VIF is (VIF) k = 1 / 1-Rk 2 where R2 k is coefficient of multiple determination. Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274. We can say that No of tweets, Years since they joined, No of photos uploaded and following back is 2.5, 1.51, 2.58, 1.54 times inflated as compared to when predictors are not linearly related. From above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Transformation There was a non-constant variance and non-normality of the residual, so to get the model assumption satisfies, we need to perform transformation. Our main goal of transformation is to get non constant variance. As no transformation can able to make normality okay. We started our variance stabilizing transformations from weakest to strongest. Weakest is the sqrt transformation Ý = sqrt Y i.e. λ=0.5 and the strongest is Ý = -1/Y, λ=-1.Sqrt root transformation could not satisfies our model assumption so we move to log transformation (i.e. λ=0). a) The current MLR model form is reasonable
  • 9. 8 | P a g e LogtenYi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi Where LogtenYi: No of followers ,β0, β1, β2, β3, β4 are the parameters Xi1: No of tweets, Xi2 Years since they joined Xi3 : No of photos uploaded ,Xi4: Following back From the above graph of residual vs each predictor i.e. No of tweets, Years since they joined No of photos uploaded and following back, we can say that there is no curvature in all of the graphs. So MLR model is reasonable. b) The residuals have constant variance Residual vs log10Ŷ From above graph, we can say that there is no funnel shape, so variance is constant.
  • 10. 9 | P a g e Modified Levene test Here α=0.05, F test: H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From table, p=0.1910 which is greater than 0.05. So we fail to reject Ho. We conclude that variance is constant. As variance is constant, we move to equal variance row of t test. Two Sample T test: H0: Variance is constant H1: Variance is not constant Decision rule: p<alpha then Reject H0. α=0.05 From table, p=0.0936, which is greater than 0.05.So we fail to reject H0. We conclude that variance is constant C) Normality plot and Normality test Normality test α=0.05 , H0: Normality is Okay H1: Normality is violated
  • 11. 10 | P a g e Decision rule: ρ < c (α, n) then Reject H0. From the critical values table c (α, n) = 0.972 and from the above table ρ = 0.974, which is greater than 0.972. So we fail to reject H0. It is a weak conclusion. We conclude that normality is okay. If we take α=0.10, then it will fail the normality test but in multiple linear regression, normality assumption is not required but desired, so we can move ahead with this result. The important assumption that need to be satisfy is non constant variance and we have gained with log transformation so we have stopped transformation at log. d) Data were not collected in time order, so time plot is not relevant. e) Diagnostic Outlying X observation The hat matrix is helpful in identifying the X outliers. The diagonal element of the hat matrix is helpful in finding X outliers. The diagonal element hii values is between 0 and 1 and their sum is p. p is no of parameters. hii is the distance between the Values of ith case and mean of X values of n cases. The diagonal element in this context is called leverage of X values. If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*5)/39 = 0.2564. Now we will look which value is greater than 0.2564. By comparing, we got 3, 9,11,15,31 and 32 observation are greater than 0.2564. So observation 3, 9,11,15,31 and 32 are x outliers. Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test.
  • 12. 11 | P a g e Bonferroni outlier test α=0.10, n=39, p = 5 Bonferroni critical value = |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 33) = t (0.9987, 33) = 3.25817 To find Y outliers, we will look which observation value is greater than 3.25817.By comparing, we could not found any Y outliers. So from above two results, we found 3,9,11,15,31,32 as X outliers. Influence After finding outlier with respect to X values, the next step is to look whether these X outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (5/39) = 0.71611 Observation No Type of Outlier DFFITS Value from Table Remarks 3 X -1.3807 From comparison, we can say that observation 3 have higher value than 0.71611.This observation have influence on the fitted value. Of no of followers 9 X -0.3078 11 X -0.6020 15 X -0.0438 31 X -0.6945 32 X 0.0242 2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of Intercept, No of tweets, Years since they joined, No of photos uploaded and Following back. Here we will look absolute value of DFBETAS. A large absolute value of DFBETAS is consider as an influential. Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202 Observation No Type of Outlier DFBETAS value from table Remarks Intercept X1 X2 X3 X4 Outlier 3 and 31 is influencing on Following back predictor coefficient and No of tweets predictor coefficient respectively. 3 X 0.0599 0.1398 -0.0490 -0.0977 -1.0565 9 X -0.0173 0.0315 0.0061 0.0439 -0.2427 11 X 0.0519 0.2885 -0.0207 -0.5411 0.1529 15 X 0.0005 -0.0241 0.0021 -0.0033 0.0029 31 X 0.0824 -0.5061 -0.0539 0.1173 0.1629 32 X 0.0802 -0.0093 -0.0203 0.0124 0.0083
  • 13. 12 | P a g e 3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values. It is denoted as Di. Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.8878 Observation No Type of Outlier Cook’s value form table Remarks 3 X 0.21800 None of the observation is above F value, so we can say that none of the observation is influential on all n fitted values of no of followers 9 X 0.06657 11 X 0.08801 15 X 0.00886 31 X 0.08420 32 X 0.00127 f) Variance Inflation Factor Guideline: (̅VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5 (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274 So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.037 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. The preliminary model is
  • 14. 13 | P a g e Log10No of Followers = 0.87514 – 0.00000570No of tweets + 0.8402Years since they joined + 0.00011227No of photos uploaded + 3.081619E-7Following back The parameter estimates are b0: 0.87514, b1 : -0.00000570, b2: 0.08402, b3 : 0.00011227, b4: 3.081619E-7 There would be decrease in no of followers when no of tweets increase by one unit, given that other variable are held constant. There would be 0.08402 increase in no of followers when Years Since they joined increase by one year, given that other predictors variable are held constant. There would be 0.00011227 increase in no of followers when No of photos uploaded increase by 1no,given that other predictors are held constant. There would be 3.08E-7 increase in no of followers when following back is increase by 1no, given that other predictors are held constant. There are 39 observation, so the degree of freedom for the corrected total is n-1 i.e. 38.We have 4 predictors, so the degree of freedom for the model is 4. The degree of freedom for the error is n-p-1 i.e. 39-4-1 = 34. Standard error: Standard error are the standard error for the regression coefficient. It helps us for constructing confidence interval. Sum of Square: Sum of square is formed by SST (Sum of square total) = SSM (Sum of square model) + SSE (Sum of square error). SST shows the total variation in the response. Sum of square error shows the unexplained variation within the no of followers i.e. yi-̂yi .We can say that this variation is due to deviation in the model and Sum of square model shows the explained variability i.e. ̂yi - ȳ . This shows variation due to model. Mean Square: It is calculated as ratio of sum of squares and its corresponding degree of freedom. Mean square error is an estimate of the variance σ2 for our model. Value of MSE is 0.02251. Root MSE: value is 0.15004.It shows the value of s, estimate for the parameter σ of our model. Departure Mean: The value is 1.51431, this indicates the mean of logtenY. F value: It is the test statistics to check whether the model is significant or not. H0: β1=β2=β3=β4 = 0 H1: not all β1, β2, β3, β4 = 0 Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05 F* = 5.41, F (0.95, 4, 35) = 2.6414 As 5.41 > 2.6414, we reject Ho and conclude that not all β1, β2, β3, β4 is zero. In other words, regression is significant. Coefficient of Multiple Determination (R2 ). From the table, the R2 value is 0.3891 fraction of variability in No of followers explained by the No of tweets, Years Since they joined, No of photos uploaded, following back. Significance of Predictors T value: It is the t statistics. It is a ratio of the parameter estimates to the standard error. The null hypothesis is regression coefficient is zero. If the regression coefficient of predictor is zero, then it does not significantly contribute to the model. We can drop the predictors when there is t* is less among all predictors
  • 15. 14 | P a g e Pr>|t| are two sided. Examination of t statistics and its corresponding p value helps us to find significance. The p value for the No of tweets(X1) and following back(X4) is greater than α (0.05), so both these predictor are not significant for predicting no of followers. The p value for the Years Since they joined (X2) is almost equal to 0.05, so this predictor is significant and No of photos uploaded(X3) is less than to α (0.05), so both these predictors are significant in predicting no of followers. III Exploration of Interaction Terms using Partial Regression Plot Partial regression plot is also called as Added-variable plot. Partial regression plot helps us to understand the marginal role of the interaction term given other predictor variables are present. From this plot, we will understand which interaction will be helpful to predict no of followers(Y). For selecting the graph, we need to see trend i.e positive upward or negative. If there is no trend, then we should not add interaction. The following interaction terms are possible 1) X1X2: No of tweets Year since they joined 2) X1X3: No of tweets No of photos uploaded 3) X1X4: No of tweets Following back 4) X2X3: Year Since they joined No of photos uploaded 5) X2X4: Year Since they joined Following back 6) X3X4: No of photos uploaded Following back Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x2 given X1, X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there is horizontal band, so this interaction does not contain additional information that is useful for predicting no of followers(Y).
  • 16. 15 | P a g e Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x1x3 given X1, X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there is horizontal band, so this interaction does not contain additional information that is useful for predicting no of followers(Y). Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x4 given X1, X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be helpful in predicting no of followers.
  • 17. 16 | P a g e Here, we have regress residual of logtenY given X1, X2, X3,X4 versus the residual of x2x3 given X1, X2, X3, X4. From the plot, we can say that there is no trend so this interaction will not provide any additional information for predicting No of followers. Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x2x4 given X1, X2, X3, X4. From the plot, we can say that trend is negative so this interaction will be helpful in predicting no of followers
  • 18. 17 | P a g e Here, we have regress residual of logtenY given X1, X2, X3, X4 versus the residual of model x3x4 given X1, X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be useful in predicting no of followers. From above interaction plots, the X1X2 and X1X3 plots have slightly negative trend so these two predictors won’t be helpful. The X2X3 plots having no trend so no need to add in the model. Between X1X4,X2X4 and X3X4, the X2X4 plot have more scatter around the regression line. So X2X4 is enough to add as an interaction term. Correlation involving the added interaction term before and after standardization. Standardization is important for the model having the interaction and the polynomial terms. Standardizing the predictors helps us to reduce the multicollinearity. In standardization, the standardized variable can be calculated by centering the mean to zero and scaling the variance to 1. Centering the predictors is important for the interaction terms. Before Standardization After Standardization
  • 19. 18 | P a g e From above results, we can see the effect of standardization. It helps to reduce the multicollinearity. From after standardization correlation matrix, we can say that correlation coefficient between No of tweets, Years Since they joined, No of photos and following back and Years since they joined*Following back has reduced. IV. Model Search For the 4 predictor variable, the no of parameter would be p=5 and the total no of model would be 2p-1 .So in our model the total no of model would be 24 = 16. It is difficult to access all the models, so to deal with such complexity, we have three model search procedure and they are a) Best Subset b) Backward deletion c) Stepwise a) Best Subset: For selection of the model, there are three criteria need to check for potentially best model 1) Cp 2) AIC (Akaike’s Information Criterion) 3) SBC (Schwarz Bayesian Criterion) Firstly, we will look R2 a in the model. At one stage Ra2 value will be level off. We will discard the model whose Ra 2 decreased from previous step. Now we will use our mentioned criteria to find our model. When Cp=p (no of parameters), then there won’t be any bias. Model 1
  • 20. 19 | P a g e Number in Model R2 a R2 Cp AIC SBC 1 0.2356 0.2557 8.7531 -141.5964 -138.26926 2 0.2863 0.3238 6.7459 -143.3425 -138.35179 3 0.3053 0.3602 6.6100 -143.4968 -136.84251 4 0.3173 0.3891 6.9077 -143.3032 -134.09534 5 0.3535 0.4386 6 -144.5965 -134.61512 By using mentioned criteria, the min Cp and min AIC is of Model 3 and min SBC is of Model 2.Though, the model 5 is having min Cp and min AIC, we have not consider this model because there is severe multicollinearity problem. From this method, we have selected following models Model 1: logtenY (No of followers) = β0+β1 Years since they joined+β2 No of photos uploaded Model 2: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos uploaded b) Backward deletion Model 2
  • 21. 20 | P a g e From backward deletion, we have selected following models Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. C) Stepwise Model 1
  • 22. 21 | P a g e From stepwise, we have selected following models Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. From above model search procedure, we have selected following two models as our best models. Model I: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. Model II: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos uploaded Model 1
  • 23. 22 | P a g e V. Model Selection Model I LogtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. Checking Model assumption a) The current MLR model is reasonable Yi = β0+β1xi1+β2xi2+εi Where , Yi No of followers,Xi1 : Years since they joined Xi2 No of photos uploaded From above graph, we can say that there is no curvature at all in any graph. So MLR model is reasonable. b) The residuals have constant variance Residual vs logtenŶ
  • 24. 23 | P a g e From the graph, we conclude that graph have constant variance as there is no funnel shape c) The residuals are normally distributed Normal Probability Plot From the graph, we conclude that graph is normal. Normality is satisfied d) The residual are uncorrelated. We have not collected data in time order. e) Diagnostic
  • 25. 24 | P a g e Outlying X observation If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*3)/39 = 0.1538. Now we will look which value is greater than 0.1538. By comparing, we got 3, 11, 15 and 32 observation are greater than 0.1538, so observation 3, 11, 15 and 32 are x outliers. Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test. Bonferroni outlier test α=0.10, n=39, p = 3 |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 35) = t (0.9987, 33) = 3.2431 To find Y outliers, we will look which observation value is greater than 3.2431.By comparing, we found all observation is less than 3.25817, so there are no Y outliers. So from above two results, we can say that there are total 3 X outliers. Influence
  • 26. 25 | P a g e After finding outlier from X values, the next step is to look whether these outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of outliers on the fitted value Ŷ (No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (3/39) = 0.5547 Observation No Type of Outlier DFFITS Value from Table Conclusion 3 X 0.1763 From comparison, we can say that none of the observation have influence on the fitted value. 11 X 0.2651 15 X 0.2602 32 X 0.4234 2) DFBETAS: It is helpful to understand influence of outliers on the regression coefficient of intercept, years since they joined and no of photos uploaded. Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202 Observation no Type of Outlier DFBETAS value from table Intercept X2 X3 3, 11, and 15 are not influential while observation 32 is slightly influential on intercept and coefficient of years since they joined 3 X 0.1554 0.1554 0.0297 11 X 0.0162 0.0388 0.2518 15 X 0.0147 0.0345 0.2410 32 X 0.3771 0.3836 0.1965 3) COOKDISTANCE: It is helpful to understand influence of outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.80381 Observation No Type of Outlier Cook’s value form table Remarks 3 X 0.07852 None of the observation is influential on fitted values. 11 X 0.02396 15 X 0.02301 32 X 0.06062 From above three influence measures, we can say that none of the outliers is influential. Variance Influence
  • 27. 26 | P a g e From above table, Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. From above table (VIF) 1 = 1.04175, (VIF) 2 = 1.04175, So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 1.04175 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Model II LogtenY (No of followers) = 0.67727- 0.00000568No of tweets + 0.11367Years since they joined + 0.000118437No of photos uploaded. Verifying Model Assumption a) The current MLR model is reasonable Yi = β0+β1xi1+β2xi2+εi where Yi No of followers , Xi1 : No of tweets, Xi2 Years since they joined , Xi3 No of photos uploaded
  • 28. 27 | P a g e From above plots of residual vs each predictors (i.e. No of tweets, Years since they joined, No of photos uploaded), we can say that there is no curvature at all in any graph. So we can say MLR model is reasonable. b) The residuals have constant variance Residual vs logten Ŷ There is no funnel shape, so variance is constant. C) The residuals are normally distributed Normal Probability plot
  • 29. 28 | P a g e It is not perfectly straight but normality is okay. d) The residuals are uncorrelated: We have not collected data in time order. e) Diagnostics Outlying X observation If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*4)/39 = 0.2051. Now we will look which value is greater than 0.2051. By comparing, we got 3, 11, 15, 31 and 32 observation are greater than 0.2564. So observation 3,11,15,31 and 32 are x outliers.
  • 30. 29 | P a g e Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test Bonferroni outlier test α=0.10, n=39, p = 4 |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 34) = t (0.9987, 33) = 3.2504 To find Y outliers, we will look which observation value is greater than 3.2504.By comparing, we found that there are no Y outliers. So from above two results, we can say that there are total 5 X outliers. Influence After finding outlier with to X values, the next step is to look whether these X outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of X outliers on the fitted value Ŷ (No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (4/39) = 0.6405 Observation No Type of Outlier DFFITS Value from Table Remarks 3 X 0.1054 From comparison, we can say that observation 11 and 32 have higher value than 0.6405.These two observation have influence on the fitted value. 11 X 0.7406 15 X 0.0734 31 X 0.7877 32 X 0.2364 2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of intercept, no of tweets, years since they joined and no of photos uploaded Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202 Observation No Type of Outlier DFBETAS value from table Remarks Intercept X1 X2 X3 11th observation have influence on No of tweets and no of photos uploaded coefficient while observation 31 have influence on no of tweets. 3 X 0.0913 0.0612 0.0916 0.0236 11 X 0.0494 0.3676 0.0994 0.6727 15 X 0.0023 0.0404 0.0075 0.0051 31 X 0.0104 0.5898 0.0523 0.1557 32 X 0.1948 0.0965 0.1969 0.1382
  • 31. 30 | P a g e 3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.8556 Observation No Type of Outlier Cook’s value form table 3 X 0.00286 None of the observation is above F value, so we can say that none of the influential to the fitted values. 11 X 0.13717 15 X 0.00139 31 X 0.15318 32 X 0.01433 From above three influence measures, we can say that X outliers are not that influential. f) The predictors are not highly correlated with each other. Variance Influence Guideline: (̅VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. From above table, (VIF) 1 = 2.50902, (VIF) 2 = 1.04198, (VIF) 3= 2.55792. So from above table, we can say that none of the predictors have VIF > 5 and ̅VIF = 2.0363 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Selection of best model Firstly, both the models are significant. The R2 value of Model 1 is 0.3238 and R2 for Model 2 is 0.3602. There is a not much increase in the R2 when we add no of tweets(X1) in model II. The residual vs logtenY for both the model are constant but the model 1 is showing more constant variance. The Normal probability plot of Model 1 is quite straight than Model II. Influence of X outliers is more in the model II as compared to model I. From the anova of Model I, all predictors are having p<α so all predictors are significant while in Model II the no of tweets(X1) are not significant as p>α and t* is less among all predictors. So No of tweets is not significant to add. It does not give additional information. From this analysis, we have selected Model 1 as best model. FINAL MULTIPLE LINEAR REGRESSION MODEL After verifying model assumption and performing diagnostics, we are considering below model as our final model. We can predict the no of followers from the Years since they joined and no of photos uploaded. It means that whenever the no of followers increases, it mainly depends on the how long a person is on twitter and secondly it depends on the no of photos he/she has uploaded.
  • 32. 31 | P a g e Fit of the model logtenŶ = 0.66796 + 0.11439Years since thy joined + 0.00006297No of photos uploaded There would be 0.11439 increase in no of followers when Years Since they joined increase by one year, given that no of photos uploaded is held constant. There would be 0.00006297 increase in no of followers when No of photos uploaded increase by 1no, given that years since they joined is held constant. As VIF values for Years since they joined and No of photos uploaded is less 5 and (̅VIF) is 1.04175 is slightly bigger than 1,so conclude that there won’t be any severe multicollinearity problem. From the table, we can say that the p value for years since they joined is less than 0.05, so this predictor is significant. The p value of no of photos uploaded is slightly greater than 0.05 so this predictor is marginally significant, but if we change the value of α to 0.10, then no of photos uploaded becomes significant. F test: To check whether regression is significant or not H0: β1=β2=0 H1: not all β1, β2, = 0 Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05 F* = 8.62, F (0.95, 2, 36) = 3.2594 As 8.62 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is significant. Coefficient of Multiple determination R2 :R2 is 0.3238. It shows the fraction of variability in No of followers is explained by the model with Years since they joined and No of photos uploaded. Joint C.I for the parameters The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using bk ± B s{bk}, where B = t ( 1 – α /2g ; n – p);
  • 33. 32 | P a g e Now, b0 = 0.0581118 b1 = 0.11439, b2 = 1.0925E-9 S {b0 } = sqrt ( 0.0581118) = 0.241063 S {b1} = sqrt (0.0012516) = 0.03537 S {b2} = sqrt (1.0925E-9) = 3.30454E-05 B = 2.47887 C.I for β0 : (-0.53945,0.655676) β1 : (0.026712,0.202068) β2 : (-8.19E-05,8.19E-05). We are 95% confident that β0 is in (-0.53945, 0.655676) β1 is in (0.026712, 0.202068) β2 is in (-1.89E-05, 1.45E-04) simultaneously. C.I, C.B and P.I at xh of interest xT h = (1 , 6.7 , 910 ) hnew,new value is smaller than the largest hii.As there wont be extrapolation, we can continue with this xh values. Confidence Interval: xh = (1.4398893, 1.5434771) We are 95% confident that mean no of followers when Years since they joined (6.5) and No of photos uploaded (910) lies between (1.4398893, 1.5434771) million Prediction Interval: xh = (1.176267, 1.8070995) We are 95% confident that whenever the years since they joined is 6.5 years and no of photos uploaded is 910,then the no of followers will lie between 1.1762 to 1.807 million. As the no of followers are in millions. Confidence band xh = (1.4167958, 1.5665706) {b0} {b1} {b2}
  • 34. 33 | P a g e We are 95% confident that the region contains the entire regression surface overall combination of values of the x variables. Final discussion Here, we perform multiple linear regression analysis to see how many predictors are helpful in predicting the no of followers. For that, we started with a preliminary model with four predictors i.e. No of tweets, Years since they joined, No of photos uploaded and Following back and checked for the model assumptions. In our preliminary model, we had a non-constant variance in our model. To remove that we did the log transformation. Even we checked the multicollinearity between the predictors. After satisfying the assumptions with our transformed model, we checked if any of the interaction terms are to be added in the model. We tried different interaction terms, out of which we found that Years since they joined *Following back (X2X4) term interaction will be helpful to predict the number of followers the most. We added this standardized interaction term to the model. We used model search procedures to find the best model. Our best model is Number of followers =Y ears since they joined Number of photos and videos posted. This model satisfies the variance constant, normality is okay and there is no serious multicollinearity problem.