Statistics for Data Analytics

STATISTICS FOR DATA ANALYTICS
REPORT ON
Multiple Regression and Logistic Regression
By: Abhishek Dahale
X17170311
National College of Ireland

Contents
Multiple Regression Analysis
Introduction 3
Data Source 3
Objective 3
Data Information 3
Assumption 4
Interpreting Output For Multiple regression 9
Result 11
Logistic Regression
Data Source 12
Objective 12
Data Information 12
Software 13
Assumptions 13
Execution Of Logistic Regression Using R 14
Interpretation 15
References

Multiple Regression Analysis
Introduction:
Multiple regression creates a best fit line to data. Multiple regression is
basically carried out to predict value of dependent variable based on two or
more independent variables. It explains the contribution of each of the
independent variables to predict the total variance on dependent variable[2]
Data Source:
Multiple regression has been done on the River water quality. The data source
is as follows:
https://data.gov.uk/dataset/c6ad19a1-360d-4e62-9fd3-26f9e3ac39dd/river-
water-quality-annual-average-concentrations-of-selected-
determinands/datafile/caf6ab6a-a2a8-4428-a209-5bd85a952078/preview
Objective:
As water is vital for all living forms of life, water quality plays an important
role. The objective of this analysis is:
1. To study the river water quality.
2. How the Alkalinity, Temperature affects the pH of water.
3. To understand the relation between the independent and dependent
variables.
Multiple regression can have achieved using:
Y=a+b1*x1+b2*x2+ ……+ bp*Xp
Data Information:
This data set involves the following components that plays important role in
identifying the River Water Quality.
• pH(potential of Hydrogen)

• Alkalinity
• Temperature
The quality of water depends on its pH value. A pH scale consists of
numbering from 1 to 14 ,of which the value 7 is the neutral point . Value
below 7 , as it goes on decreasing the water becomes more acidic ,1 being
most . Value above 7 indicates alkalinity, with 14 as most alkaline. Here we
have considered pH as dependent variable which depends on temperature
and alkalinity.
Data Cleanup:
Data was extracted from the above mentioned source and Data cleaning
was done using R programming. Unwanted columns and rows were
removed from the data source. Code for the same is attached below:
setwd("E:/NCI/sem1/STATS")
WaterQualityData <- read.csv("E:/NCI/sem1/STATS/WaterQuality.csv",
header=T, na.strings=c(""), stringsAsFactors = T)
WaterQualityData <- WaterQualityData[,c("pH","Temp","Alkaline")]
sapply(WaterQualityData,function(x) sum(is.na(x)))
WaterQualityData<WaterQualityData[!is.na(WaterQualityData$Alkaline),
]
WaterQualityData <- WaterQualityData[!is.na(WaterQualityData$Temp), ]
write.csv(WaterQualityData,"WaterQualityDatCleaneddata.csv")
Assumptions:
Assumptions 1 :

River Water quality is estimated on pH which is measured on continuous scale.
For perfroming Multiple regression analysis dependent variable needs to be
continuous. Here our data set supports Multiple regression.
Assumption 2 :
Second assumption include that we must have two or more independent
variables ,which should be continuous or categorical. Supporting this argument
we have two 2 independent variables viz. Temp,Alkalinity.
Assumption 3 :
Third assumption includes that our data should support independence of
residuals. We can perform this by using Durbin-Watson Statistics.
Below is the Model Summary for Durbin-Waston.
As we know Durbin-Watson Statistic Value should lie between 1.5 and 2.5
,here our d=0.433 i.e. 1.5 < d < 2.5 .Therefore, we can conclude here that linear
autocorrection is not present.
Assumption 4:
In this assumption ,There is a need to analyse relationship between each
independent variable with the dependent variable. There are number of ways
to check the linear relationship ,here we have considered Scatterplot to
analyse .Here we can see that ,for our dependent variable pH and the two
independent variables temperature and Alkalinity, there exists a linear
relationship. Therefore we are going to consider these 2 independent variables
in our regression.

Assumption 5:
According to 5th
assumption ,our data must follow homoscedasticity i.e.
variance in the data must be similar along the best fit line. Normal P-P Plot of
Regression Standardized Residual Dependent Variable proves our assumption
to be correct. Therefore, chances of error in our dataset are negligible
according to the below graph.

Assumption 6 :
According to assumption 6 ,our data set must not be multicollinear. There
must not be any relation between our independent variables i.e. Temp and
Alakline. Thjs can be analysed by using below table .Based on the outputs of
Coefficients for Collinearity Statisctics ,we have VIF=1.263 for both the
independent variables .So here we can conclude that ,as the value lies
between 1 to 10 ,there is no collinearity symptoms in our data.

Assumption 7:According to this assumption our data set must not contain any
outliers. This outliers can have impact on the regression results. Therefore ,for
predicting value for dependent variables from independent variables can have
negative effect. Therefore from the below histogram we can see that our
dataset doesn’t contain any outliers.
Assumption 8 :
Our assumption includes that we need to check for the residuals are normally
distributed.Here our Normal P-P plot of standardized residual for dependent
variable pH shows that residuals are normally distributed.[1]

Interpreting Output For Multiple regression:
1.Evaluating how well the model fits:
This table provides us with value R i.e. the multiple correlation coefficient,
which can be used to predict the value of variance for our dependent variable
pH. Here ,in our case value of R = 0.711,which resembles a good prediction
value.

2.Estimated model coefficients:
To estimate pH from Temperature and alkaline is given by :
predicted pH = 7.156 - .003*Temp + .005 * Alkaline
which is calculated using below coefficient table
3.Statistical Significance:
We can check whether overall regression model is a good fit for the data using
F-ratio in the ANOVA .Here F(2,502)=256.240, p < .0005,which shows
independent variables statistically significantly predict the dependent variables
for the good fit of data.

4.Statistical Significance of independent variables:
Here ,we are checking the statistical significance of Temperature and alkaline,
our independent variables. This test is carried out for checking whether the
standardized value are equal to zero. Here t= -0.297 & sig=0.766 for Temp and
t= 20.274 & sig=0.00 ,here p(sig)< 0.05 shows that the coefficients are
statistically significantly.
Result:
Here we performed multiple regression analysis to predict the pH of River
Water Quality from temperature and alkaline. These independent variables
were used to predict statistically significant pH, F(2,502)=256.240, P < .0005,R2
=.505 .These two variables added statistical significance to predict p<0.05.

LOGISTIC REGRESSION ANALYSIS
Data Source:
Logistic regression analysis was performed on diabetes. The source of data is
as follows:
https://data.gov.uk/search?q=diabetes
Objective :
The main objective behind choosing the data is the factors that play role in
developing diabetes. Gender of the person plays an important role in this.
Aim of this analysis is to :
1. Study the gender suffering diabetes.
2. Predict the age which is more prone to diabetes.
Data Information:
The dataset used here for analysis delineates the information about diabetes
affected people which involves factors such as gender ,age ,cholesterol, height
and weight.
As we know that for performing logistic regression analysis, our dataset must
contain dichotomous categorical variable. Here we have gender as
dichotomous variable, where male is coded as 0 and female is coded as 1. This
task was performed using R programming language. Variable view of the
dataset is as follows:
Software:
I have used R for this analysis as it is most suitable technique to perform
statistical analysis. R code for importing data is as follows:

LogisticRegression <- read.csv("E:/NCI/sem1/STATS/diabetes.csv", header=T,
na.strings=c(""), stringsAsFactors = T)
Appropriate cleaning was done for the data and null values were removed
from the dataset. Also unused columns were removed using R code.
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$glyhb), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$frame), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$height), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$chol), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$weight), ]
LogisticRegression<LogisticRegression[,c("chol","age","gender","height","we
ight")]
Assumptions:
For analysing our data for logistic regression we need to make sure that data
we are using can actually be used for logistic regression[1]. This can be done by
considering the following assumptions .If our dataset passes the following
assumptions we can consider this data to perform logistic regression.
Assumption 1:
According to the 1st
assumption our dependent variable must be dichotomous
categorical variable. We have our dependent variable is gender which was
further coded as Male=1 and female=0 using the following R code
LogisticRegression$gender<- ifelse(LogisticRegression$gender =="male",1,0)
Here our 1st
test of assumption is passed ,so that we can consider this data for
logistic regression.
Assumption 2 :
Second assumption states that we must have one or more independent
variables. These variables can either be continuous or categorical .Here our

data set includes continuous variables in as Weight ,Age and Height and also
one categorical variable. Therefore we can consider this dataset for logistic
regression.
Assumption 3 :
For this assumption , we can observe the independence of observations. Here
Gender is mutually exhaustive and exclusive. Our assumption is correct and
we can perform logistic regression using this dataset.
Assumptions 4 :
According to this assumption ,our independent variables Weight ,Age, Chol and
Height possess a linear relationship with Gender which is our dependent
variable.[1]
Execution Of Logistic Regression Using R :
Model Fitting :
For fitting the model we have used following command :
mylogit <- glm(gender ~ chol + age + height + weight ,data =
LogisticRegression,family = "binomial")
summary(mylogit)
Executing the above commands give the summary of model as follows:
Deviance Residuals:
Deviance Residuals is the estimation for model for its fitness. Below table
delineates the distribution for Deviance Residuals for every individual.
Min 1Q Median 3Q Max
-3.4319 -0.4721 -0.1757 0.4555 3.9251

Summary of the model is as follows:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 50.074467 5.078074 -9.861 < 2e-16 ***
chol -0.002041 0.003540 -0.577 0.564163
age 0.037806 0.010689 3.537 0.000405 ***
height 0.753908 0.076356 9.874 < 2e-16 ***
weight -0.009551 0.004100 -2.329 0.019844 *
This part shows the details of coefficients .It includes the Estimates, Standard
error ,z-values and the p values. Logistics regression shows the change in logs
for every one unit in the supporting predictor variables. Here if the values of
chol(cholestrol), age, height or weight changes by one unit ,then the odds of
dependent variable Gender changes by -0.002041, 0.037806, 0.753908 or -
0.009551 respectively.
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 507.07 on 372 degrees of freedom
Residual deviance: 259.62 on 368 degrees of freedom
AIC: 269.62
Number of Fisher Scoring iterations: 6
Interpretation:
R reports two evidences – Null and Residual.
Null deviance describes the response variable prediction of model.
Residual predicts this by adding the independent variables.
Akaike Information Criterion (AIC) can de used to define the quality of model.
We can use the anova() for analysing the deviance.
anova(mylogit,test = "Chisq")

Analysis of Deviance Table
Model: binomial, link: logit
Response: gender
Terms added sequentially (first to last)
Df Deviance
Resid.
Df Resid. Dev Pr(>Chi)
NULL 372 507.07
chol 1 0.576 371 506.49 0.44807
age 1 2.409 370 504.08 0.12064
height 1 238.844 369 265.24 < 2e-16 ***
weight 1 5.622 368 259.62 0.01773 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(pscl)
pR2(mylogit)
llh llhNull G2 McFadden r2ML r2CU
-129.80828
39
-253.53348
77
247.45040
75
0.4880034 0.4849060 0.6524635
Here there is no exact equivalent to the R2
,therefore we can use McFadden R2
to estimate model fit.
Evaluating the predictive ability of Logistic Regression Model:
fitted.results <- predict(mylogit,newdata = subset(LogisticRegression,
select=c("chol","age","gender","height","weight")), type = "response")
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != LogisticRegression$gender)

print(paste("Acurracy",1-misClasificError))
"Acurracy 0.849865951742627"
0.84 accuracy is for this test is a good result.
library(ROCR)
p <- predict(mylogit,newdata = subset(LogisticRegression,select=c("chol","ag
e","gender","height","weight")),type = "response")
pr <- prediction(p,LogisticRegression$gender)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
[1] 0.9325298
The Following ROC plot shows plotted against True Positive Rate Vs False
Positive Rate. Here we have ROC to measure AUC(Area Under The Curve) whic
h can be used to estimate performance of binary classifier. As per the thumb
rule ,model with better predictivity should have AUC closer to 1 than 0.5. Here
we have TPR close to 1 ,which means our model is perfect.

References
1. SPSS Survival Manual by Julie Pallant.
2. https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss
-statistics.php
3. https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

Statistics for Data Analytics

Recommended

Recommended

More Related Content

Similar to Statistics for Data Analytics

Similar to Statistics for Data Analytics (20)

Recently uploaded

Recently uploaded (20)

Statistics for Data Analytics