Introduction to statistics

6/2/2019 MATH1324 Introduction to Statistics Assignment 3 (5)
file:///C:/Users/soumya/Desktop/MATH1324.html#(5) 1/19
MATH1324 Introduction to Statistics
Assignment 3
Simple Linear Regression
Harsha Kumar(S3752953), Soumya Hiremath(S3746319)
June 2, 2019

RPubs link information
Rpubs link: http://rpubs.com/Soumyash/501478

Introduction
A simple linear regression is used to examine relationship between two quantative
variables by predicting the value of dependent variable y, assuming predictor
variable x provides information about it.
Steps involved are:
Import the dataset
Do exploratory analysis
Check outliers
Check correlation
Create a relationship model using the lm() function
Find the coefficients from the model created
Plot relationship graph

Problem Statement
Can years of experience of employees be used to predict the salary?
This can be achieved using simple linear regression with which we predict salary
(Salary) by establishing a statistically significant linear relationship with years of
experience(YearsExperience) and correlation to measure the strength of the linear
relationship between the two variables.

Data
The dataset used is Open Data from Kaggle.com, choosen from the below link:
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-
regression/version/1
The dataset contains 30 observations with two variables which are:
1. YearsExperience: continous numerical values
2. Salary: continous numerical values

Data Exploratory Analysis.
The data is loaded and evaluated using str function to check the data types and
values of the attributes. When data analysis is performed, data containing missing
values is often encountered. In our dataset there are no missing values.
#Importing the dataset
Salary<- read.csv("C:/Users/soumya/Desktop/is/Salary_Data.csv")
#Evaluating dataset using str function
str(Salary)
## 'data.frame': 30 obs. of 2 variables:
## $ YearsExperience: num 1.1 1.3 1.5 2 2.2 2.9 3 3.2 3.2 3.7 ...
## $ Salary : num 39343 46205 37731 43525 39891 ...
#Checking missing values
print(Salary$Salary[is.na(Salary$Salary)])
## numeric(0)
print(Salary$YearsExperience[is.na(Salary$YearsExperience)])
## numeric(0)

Descriptive Statistics and Visualisation
The tables below shows the statistical summaries about the variables in the
dataset.The box plot is used to visualize the outliers for each variable in the
relation. From the below plot we can see there are no outliers found.
Salary %>% summarise(Min = min(YearsExperience,na.rm = TRUE),
Q1 = quantile(YearsExperience,probs = .25,na.rm =TRUE),
Median = median(YearsExperience, na.rm = TRUE),
Q3 = quantile(YearsExperience,probs = .75,na.rm =TRUE),
Max = max(YearsExperience,na.rm = TRUE),
Mean = mean(YearsExperience, na.rm = TRUE),
SD = sd(YearsExperience, na.rm = TRUE),
n = n(),
Missing = sum(is.na(YearsExperience))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
1.1 3.2 4.7 7.7 10.5 5.313333 2.837888 30 0
Salary %>% summarise(Min = min(Salary,na.rm = TRUE),
Q1 = quantile(Salary,probs = .25,na.rm =TRUE),
Median = median(Salary, na.rm = TRUE),
Q3 = quantile(Salary,probs = .75,na.rm =TRUE),
Max = max(Salary,na.rm = TRUE),
Mean = mean(Salary, na.rm = TRUE),
SD = sd(Salary, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Salary))) -> table2
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
1.1 3.2 4.7 7.7 10.5 5.313333 2.837888 30 0
#Divide the graph area in 2 columns
par(mfrow=c(1, 2))
#Check outliers in YearsExperience
boxplot(Salary$YearsExperience, main="Experience Outliers", sub=paste("Outlierrows:",boxplot.stats(Salary$YearsExperience)$out))
#Check outliers in Salary
boxplot(Salary$Salary, main="Salary Outliers",
sub=paste("Outlier rows: ", boxplot.stats(Salary$Salary)$out))

Correlation
A Pearson’s correlation, r was calculated to measure the strength of the linear
relationship between YearsExperience and Salary. Correlation can take values
between -1 to +1. The positive correlation was statistically significant, r=0.978,
p<.001 and 95% CI [0.954, 0.989]. This means as YearsExperience increases Salary
also increases.The scatter plot is used to visualize how the relationship between the
two variables looks like.
# correlation test
cor.test(Salary$Salary, Salary$YearsExperience)
##
## Pearson's product-moment correlation
##
## data: Salary$Salary and Salary$YearsExperience
## t = 24.95, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9542949 0.9897078
## sample estimates:
## cor
## 0.9782416
#scatter plot
plot(YearsExperience ~ Salary, data = Salary, xlab = "Salary", ylab = "Years of Experience")

# Visualize correlation
pairs(Salary)

Hypothesis Testing
A Simple linear regression is performed by checking assumptions, interpreting
important output and testing the statistical hypotheses of a linear regression model.
The first step is to plot the relationship between x and y variables, YearsExperience
and Salary, to determine if linear regression is suitable.The data exhibited a positive
linear trend. Now, we proceed fitting the linear regression model using the lm()
function.
From the summary table, we can see R-squared value is 0.957 which is close to 1.
There was statistically significant evidence that the data fit a linear regression
model.
The model summary also reports an F statistic which is used to test the overall
regression model. The F-test for the linear regression has the following statistical
hypotheses:
H0:The data do not fit the linear regression model
HA:The data fit the linear regression model
Assuming the data do not fit a linear model in the population, the F statistic
reported in the summary as F=622.5, will have a F distribution with df1=1 and
df2=n-2=30-2=28 and p value less than the 0.05 level of significance,we reject H0.
There is a statistically significant positive relationship between YearsExperience and
Salary. Hence, the data fits the linear regression model.
model1 <- lm(Salary ~ YearsExperience, data = Salary)
model1 %>% summary()
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.2 2273.1 11.35 5.51e-12 ***
## YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16

model1 %>% summary() %>% coef()
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.200 2273.0534 11.34694 5.511950e-12
## YearsExperience 9449.962 378.7546 24.95009 1.143068e-20
model1 %>% confint()
## 2.5 % 97.5 %
## (Intercept) 21136.061 30448.34
## YearsExperience 8674.119 10225.81
2*pt(q = 24.95,df = 30 - 2, lower.tail=FALSE)
## [1] 1.143184e-20
To test the statistical significance of the intercept/constant, we set the following
statistical hypotheses:
Ho:α=0
HA:α≠0
The intercept/constant is reported as α = 25792.200, which represents the average
Salary when YearsExperience is equal to 0 and p<.001. This hypothesis is tested
using a t statistic, reported as, t = 11.35,p<.001 . The constant is statistically
significant at the 0.05 level. This means that there is statistically significant evidence
that the constant is not 0. R reports the 95% CI for a to be [21136.061,30448.34].
H0:??=0 is clearly not captured by this interval, Hence it was rejected.
The hypothesis test of the slope, b, was as follows:
Ho:β=0
HA:β≠0
The slope of the regression line was reported as b = 9449.962. The slope
represents the average increase in salary following a one unit increase in years of
experience. This is a positive change. This hypothesis is tested using a t statistic,
reported as, t = 24.95 and p<0.05. Also, R reports the 95% CI for b to be
[8674.119, 10225.81]. This 95% CI does not capture H0, therefore it was rejected.
There was statistically significant evidence that YearsExperience was positively
related to Salary.

Hypthesis Testing Cont.
Before reporting the final regression model, we must validate all the following
assumptions for linear regression.
Independence
Linearity
Normality of residuals
Homoscedasticity
Residuals Vs Fitted:
Independence is checked through the research design. We must ensure that all
measurements between observations are independent. If the relationship between
fitted values and residuals is flat (look at the red line), this is a good indication that
there is a linear relationship.
Normal Q-Q:
We check the normal Q-Q plot to determine if there were any gross deviations
from normality (e.g obvious S shapes or non-linear trends). The plot above suggests
there are no major deviations from normality.
Scale-Location:
This is another plot used to check homoscedasticity(assumption of homogeneity of
variance). In the plot above, the red line is close to flat and the variance in the
square root of the standardised residuals is consistent across predicted (fitted
values). Hence, the line fits to the data.
Residual vs Leverage:
This plot is used to identify cases that might be unduly influencing the fit of the
regression model, for example, outliers. We need to look for values that fall beyond
the red bands in the plot. These bands are based on Cook’s distances. In the
diagnostic plot above, there are no values that fall outside the bands, and therefore,
no evidence of influential cases.
ggplot(Salary, aes(x=Salary, y=YearsExperience)) +
geom_point(shape=19, colour="red", fill="blue") +
geom_smooth(method='lm', formula=y~x) +

labs(title="Salary and Years of Experience Regression") +
labs(x="Salary") +
labs(y="Years of Experience")
par(mfrow=c(2, 2))
plot(model1)

Summary
The scatter plot demonstrated evidence of a positive linear relationship. Other
non-linear trends were ruled out.
The overall regression model was statistically significant, F(1,28)=622.5, p<.001, and
explained 95.7% of the variability in Salary, R-sqaured=0.957.
Final inspection of the residuals supported normality and homoscedasticity.
Decision:
Overall model: Reject H0.
Intercept: Reject H0.
Slope: Reject H0.
Conclusion:
There was a statistically significant positive linear relationship between
YearsExperience and Salary.

Discussion
A linear regression model was fitted to predict the dependent variable, Salary, using
measures of YearsExperience as a single predictor.
Strength:
In-depth model analysis to make accurate prediction.
Limitations:
This analysis is just limited to employees of one firm.
Open data published by an individual which may contain data entry errors.
Future investigations:
Is there any salary difference among different firms? Can we create regression
models for each firms?
Is there any other attribute that affects the Salary other than years of experience?

References
1. http://r-statistics.co/Linear-Regression.html
2. https://www.polyglotdeveloper.com/r-projects/2016-09-30-Predicting-
salaries-using-linear-regression/

Introduction to statistics

More Related Content

Recently uploaded

Featured

Introduction to statistics