SlideShare a Scribd company logo
1 of 18
Download to read offline
STATISTICS FOR DATA ANALYTICS
REPORT ON
Multiple Regression and Logistic Regression
By: Abhishek Dahale
X17170311
National College of Ireland
Contents
Multiple Regression Analysis
Introduction 3
Data Source 3
Objective 3
Data Information 3
Assumption 4
Interpreting Output For Multiple regression 9
Result 11
Logistic Regression
Data Source 12
Objective 12
Data Information 12
Software 13
Assumptions 13
Execution Of Logistic Regression Using R 14
Interpretation 15
References
Multiple Regression Analysis
Introduction:
Multiple regression creates a best fit line to data. Multiple regression is
basically carried out to predict value of dependent variable based on two or
more independent variables. It explains the contribution of each of the
independent variables to predict the total variance on dependent variable[2]
Data Source:
Multiple regression has been done on the River water quality. The data source
is as follows:
https://data.gov.uk/dataset/c6ad19a1-360d-4e62-9fd3-26f9e3ac39dd/river-
water-quality-annual-average-concentrations-of-selected-
determinands/datafile/caf6ab6a-a2a8-4428-a209-5bd85a952078/preview
Objective:
As water is vital for all living forms of life, water quality plays an important
role. The objective of this analysis is:
1. To study the river water quality.
2. How the Alkalinity, Temperature affects the pH of water.
3. To understand the relation between the independent and dependent
variables.
Multiple regression can have achieved using:
Y=a+b1*x1+b2*x2+ ……+ bp*Xp
Data Information:
This data set involves the following components that plays important role in
identifying the River Water Quality.
• pH(potential of Hydrogen)
• Alkalinity
• Temperature
The quality of water depends on its pH value. A pH scale consists of
numbering from 1 to 14 ,of which the value 7 is the neutral point . Value
below 7 , as it goes on decreasing the water becomes more acidic ,1 being
most . Value above 7 indicates alkalinity, with 14 as most alkaline. Here we
have considered pH as dependent variable which depends on temperature
and alkalinity.
Data Cleanup:
Data was extracted from the above mentioned source and Data cleaning
was done using R programming. Unwanted columns and rows were
removed from the data source. Code for the same is attached below:
setwd("E:/NCI/sem1/STATS")
WaterQualityData <- read.csv("E:/NCI/sem1/STATS/WaterQuality.csv",
header=T, na.strings=c(""), stringsAsFactors = T)
WaterQualityData <- WaterQualityData[,c("pH","Temp","Alkaline")]
sapply(WaterQualityData,function(x) sum(is.na(x)))
WaterQualityData<WaterQualityData[!is.na(WaterQualityData$Alkaline),
]
WaterQualityData <- WaterQualityData[!is.na(WaterQualityData$Temp), ]
write.csv(WaterQualityData,"WaterQualityDatCleaneddata.csv")
Assumptions:
Assumptions 1 :
River Water quality is estimated on pH which is measured on continuous scale.
For perfroming Multiple regression analysis dependent variable needs to be
continuous. Here our data set supports Multiple regression.
Assumption 2 :
Second assumption include that we must have two or more independent
variables ,which should be continuous or categorical. Supporting this argument
we have two 2 independent variables viz. Temp,Alkalinity.
Assumption 3 :
Third assumption includes that our data should support independence of
residuals. We can perform this by using Durbin-Watson Statistics.
Below is the Model Summary for Durbin-Waston.
As we know Durbin-Watson Statistic Value should lie between 1.5 and 2.5
,here our d=0.433 i.e. 1.5 < d < 2.5 .Therefore, we can conclude here that linear
autocorrection is not present.
Assumption 4:
In this assumption ,There is a need to analyse relationship between each
independent variable with the dependent variable. There are number of ways
to check the linear relationship ,here we have considered Scatterplot to
analyse .Here we can see that ,for our dependent variable pH and the two
independent variables temperature and Alkalinity, there exists a linear
relationship. Therefore we are going to consider these 2 independent variables
in our regression.
Assumption 5:
According to 5th
assumption ,our data must follow homoscedasticity i.e.
variance in the data must be similar along the best fit line. Normal P-P Plot of
Regression Standardized Residual Dependent Variable proves our assumption
to be correct. Therefore, chances of error in our dataset are negligible
according to the below graph.
Assumption 6 :
According to assumption 6 ,our data set must not be multicollinear. There
must not be any relation between our independent variables i.e. Temp and
Alakline. Thjs can be analysed by using below table .Based on the outputs of
Coefficients for Collinearity Statisctics ,we have VIF=1.263 for both the
independent variables .So here we can conclude that ,as the value lies
between 1 to 10 ,there is no collinearity symptoms in our data.
Assumption 7:According to this assumption our data set must not contain any
outliers. This outliers can have impact on the regression results. Therefore ,for
predicting value for dependent variables from independent variables can have
negative effect. Therefore from the below histogram we can see that our
dataset doesn’t contain any outliers.
Assumption 8 :
Our assumption includes that we need to check for the residuals are normally
distributed.Here our Normal P-P plot of standardized residual for dependent
variable pH shows that residuals are normally distributed.[1]
Interpreting Output For Multiple regression:
1.Evaluating how well the model fits:
This table provides us with value R i.e. the multiple correlation coefficient,
which can be used to predict the value of variance for our dependent variable
pH. Here ,in our case value of R = 0.711,which resembles a good prediction
value.
2.Estimated model coefficients:
To estimate pH from Temperature and alkaline is given by :
predicted pH = 7.156 - .003*Temp + .005 * Alkaline
which is calculated using below coefficient table
3.Statistical Significance:
We can check whether overall regression model is a good fit for the data using
F-ratio in the ANOVA .Here F(2,502)=256.240, p < .0005,which shows
independent variables statistically significantly predict the dependent variables
for the good fit of data.
4.Statistical Significance of independent variables:
Here ,we are checking the statistical significance of Temperature and alkaline,
our independent variables. This test is carried out for checking whether the
standardized value are equal to zero. Here t= -0.297 & sig=0.766 for Temp and
t= 20.274 & sig=0.00 ,here p(sig)< 0.05 shows that the coefficients are
statistically significantly.
Result:
Here we performed multiple regression analysis to predict the pH of River
Water Quality from temperature and alkaline. These independent variables
were used to predict statistically significant pH, F(2,502)=256.240, P < .0005,R2
=.505 .These two variables added statistical significance to predict p<0.05.
LOGISTIC REGRESSION ANALYSIS
Data Source:
Logistic regression analysis was performed on diabetes. The source of data is
as follows:
https://data.gov.uk/search?q=diabetes
Objective :
The main objective behind choosing the data is the factors that play role in
developing diabetes. Gender of the person plays an important role in this.
Aim of this analysis is to :
1. Study the gender suffering diabetes.
2. Predict the age which is more prone to diabetes.
Data Information:
The dataset used here for analysis delineates the information about diabetes
affected people which involves factors such as gender ,age ,cholesterol, height
and weight.
As we know that for performing logistic regression analysis, our dataset must
contain dichotomous categorical variable. Here we have gender as
dichotomous variable, where male is coded as 0 and female is coded as 1. This
task was performed using R programming language. Variable view of the
dataset is as follows:
Software:
I have used R for this analysis as it is most suitable technique to perform
statistical analysis. R code for importing data is as follows:
LogisticRegression <- read.csv("E:/NCI/sem1/STATS/diabetes.csv", header=T,
na.strings=c(""), stringsAsFactors = T)
Appropriate cleaning was done for the data and null values were removed
from the dataset. Also unused columns were removed using R code.
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$glyhb), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$frame), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$height), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$chol), ]
LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$weight), ]
LogisticRegression<LogisticRegression[,c("chol","age","gender","height","we
ight")]
Assumptions:
For analysing our data for logistic regression we need to make sure that data
we are using can actually be used for logistic regression[1]. This can be done by
considering the following assumptions .If our dataset passes the following
assumptions we can consider this data to perform logistic regression.
Assumption 1:
According to the 1st
assumption our dependent variable must be dichotomous
categorical variable. We have our dependent variable is gender which was
further coded as Male=1 and female=0 using the following R code
LogisticRegression$gender<- ifelse(LogisticRegression$gender =="male",1,0)
Here our 1st
test of assumption is passed ,so that we can consider this data for
logistic regression.
Assumption 2 :
Second assumption states that we must have one or more independent
variables. These variables can either be continuous or categorical .Here our
data set includes continuous variables in as Weight ,Age and Height and also
one categorical variable. Therefore we can consider this dataset for logistic
regression.
Assumption 3 :
For this assumption , we can observe the independence of observations. Here
Gender is mutually exhaustive and exclusive. Our assumption is correct and
we can perform logistic regression using this dataset.
Assumptions 4 :
According to this assumption ,our independent variables Weight ,Age, Chol and
Height possess a linear relationship with Gender which is our dependent
variable.[1]
Execution Of Logistic Regression Using R :
Model Fitting :
For fitting the model we have used following command :
mylogit <- glm(gender ~ chol + age + height + weight ,data =
LogisticRegression,family = "binomial")
summary(mylogit)
Executing the above commands give the summary of model as follows:
Deviance Residuals:
Deviance Residuals is the estimation for model for its fitness. Below table
delineates the distribution for Deviance Residuals for every individual.
Min 1Q Median 3Q Max
-3.4319 -0.4721 -0.1757 0.4555 3.9251
Summary of the model is as follows:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 50.074467 5.078074 -9.861 < 2e-16 ***
chol -0.002041 0.003540 -0.577 0.564163
age 0.037806 0.010689 3.537 0.000405 ***
height 0.753908 0.076356 9.874 < 2e-16 ***
weight -0.009551 0.004100 -2.329 0.019844 *
This part shows the details of coefficients .It includes the Estimates, Standard
error ,z-values and the p values. Logistics regression shows the change in logs
for every one unit in the supporting predictor variables. Here if the values of
chol(cholestrol), age, height or weight changes by one unit ,then the odds of
dependent variable Gender changes by -0.002041, 0.037806, 0.753908 or -
0.009551 respectively.
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 507.07 on 372 degrees of freedom
Residual deviance: 259.62 on 368 degrees of freedom
AIC: 269.62
Number of Fisher Scoring iterations: 6
Interpretation:
R reports two evidences – Null and Residual.
Null deviance describes the response variable prediction of model.
Residual predicts this by adding the independent variables.
Akaike Information Criterion (AIC) can de used to define the quality of model.
We can use the anova() for analysing the deviance.
anova(mylogit,test = "Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: gender
Terms added sequentially (first to last)
Df Deviance
Resid.
Df Resid. Dev Pr(>Chi)
NULL 372 507.07
chol 1 0.576 371 506.49 0.44807
age 1 2.409 370 504.08 0.12064
height 1 238.844 369 265.24 < 2e-16 ***
weight 1 5.622 368 259.62 0.01773 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(pscl)
pR2(mylogit)
llh llhNull G2 McFadden r2ML r2CU
-129.80828
39
-253.53348
77
247.45040
75
0.4880034 0.4849060 0.6524635
Here there is no exact equivalent to the R2
,therefore we can use McFadden R2
to estimate model fit.
Evaluating the predictive ability of Logistic Regression Model:
fitted.results <- predict(mylogit,newdata = subset(LogisticRegression,
select=c("chol","age","gender","height","weight")), type = "response")
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != LogisticRegression$gender)
print(paste("Acurracy",1-misClasificError))
"Acurracy 0.849865951742627"
0.84 accuracy is for this test is a good result.
library(ROCR)
p <- predict(mylogit,newdata = subset(LogisticRegression,select=c("chol","ag
e","gender","height","weight")),type = "response")
pr <- prediction(p,LogisticRegression$gender)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
[1] 0.9325298
The Following ROC plot shows plotted against True Positive Rate Vs False
Positive Rate. Here we have ROC to measure AUC(Area Under The Curve) whic
h can be used to estimate performance of binary classifier. As per the thumb
rule ,model with better predictivity should have AUC closer to 1 than 0.5. Here
we have TPR close to 1 ,which means our model is perfect.
References
1. SPSS Survival Manual by Julie Pallant.
2. https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss
-statistics.php
3. https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

More Related Content

Similar to Statistics for Data Analytics

Correation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareCorreation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareshrikrishna kesharwani
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysisRaman Kannan
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdfEmerson Ceras
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionKhalid Aziz
 
Adjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORTAdjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11Bonnie Green
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxjane3dyson92312
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxfestockton
 
manecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxmanecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxasdfg hjkl
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data AnalyticsTushar Dalvi
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
Project -- Second DeliverableIntroductionAfter reviewing the.docx
Project -- Second DeliverableIntroductionAfter reviewing the.docxProject -- Second DeliverableIntroductionAfter reviewing the.docx
Project -- Second DeliverableIntroductionAfter reviewing the.docxbriancrawford30935
 
What is water quality management
What is water quality managementWhat is water quality management
What is water quality managementselinasimpson311
 

Similar to Statistics for Data Analytics (20)

Correation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R softwareCorreation, Linear Regression and Multilinear Regression using R software
Correation, Linear Regression and Multilinear Regression using R software
 
Essay on-data-analysis
Essay on-data-analysisEssay on-data-analysis
Essay on-data-analysis
 
Doe introductionh
Doe introductionhDoe introductionh
Doe introductionh
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Adjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORTAdjusting PageRank parameters and comparing results : REPORT
Adjusting PageRank parameters and comparing results : REPORT
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docx
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docx
 
manecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxmanecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptx
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
ANCOVA in R
ANCOVA in RANCOVA in R
ANCOVA in R
 
Project -- Second DeliverableIntroductionAfter reviewing the.docx
Project -- Second DeliverableIntroductionAfter reviewing the.docxProject -- Second DeliverableIntroductionAfter reviewing the.docx
Project -- Second DeliverableIntroductionAfter reviewing the.docx
 
Regression
RegressionRegression
Regression
 
What is water quality management
What is water quality managementWhat is water quality management
What is water quality management
 
Time series project
Time series projectTime series project
Time series project
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Statistics for Data Analytics

  • 1. STATISTICS FOR DATA ANALYTICS REPORT ON Multiple Regression and Logistic Regression By: Abhishek Dahale X17170311 National College of Ireland
  • 2. Contents Multiple Regression Analysis Introduction 3 Data Source 3 Objective 3 Data Information 3 Assumption 4 Interpreting Output For Multiple regression 9 Result 11 Logistic Regression Data Source 12 Objective 12 Data Information 12 Software 13 Assumptions 13 Execution Of Logistic Regression Using R 14 Interpretation 15 References
  • 3. Multiple Regression Analysis Introduction: Multiple regression creates a best fit line to data. Multiple regression is basically carried out to predict value of dependent variable based on two or more independent variables. It explains the contribution of each of the independent variables to predict the total variance on dependent variable[2] Data Source: Multiple regression has been done on the River water quality. The data source is as follows: https://data.gov.uk/dataset/c6ad19a1-360d-4e62-9fd3-26f9e3ac39dd/river- water-quality-annual-average-concentrations-of-selected- determinands/datafile/caf6ab6a-a2a8-4428-a209-5bd85a952078/preview Objective: As water is vital for all living forms of life, water quality plays an important role. The objective of this analysis is: 1. To study the river water quality. 2. How the Alkalinity, Temperature affects the pH of water. 3. To understand the relation between the independent and dependent variables. Multiple regression can have achieved using: Y=a+b1*x1+b2*x2+ ……+ bp*Xp Data Information: This data set involves the following components that plays important role in identifying the River Water Quality. • pH(potential of Hydrogen)
  • 4. • Alkalinity • Temperature The quality of water depends on its pH value. A pH scale consists of numbering from 1 to 14 ,of which the value 7 is the neutral point . Value below 7 , as it goes on decreasing the water becomes more acidic ,1 being most . Value above 7 indicates alkalinity, with 14 as most alkaline. Here we have considered pH as dependent variable which depends on temperature and alkalinity. Data Cleanup: Data was extracted from the above mentioned source and Data cleaning was done using R programming. Unwanted columns and rows were removed from the data source. Code for the same is attached below: setwd("E:/NCI/sem1/STATS") WaterQualityData <- read.csv("E:/NCI/sem1/STATS/WaterQuality.csv", header=T, na.strings=c(""), stringsAsFactors = T) WaterQualityData <- WaterQualityData[,c("pH","Temp","Alkaline")] sapply(WaterQualityData,function(x) sum(is.na(x))) WaterQualityData<WaterQualityData[!is.na(WaterQualityData$Alkaline), ] WaterQualityData <- WaterQualityData[!is.na(WaterQualityData$Temp), ] write.csv(WaterQualityData,"WaterQualityDatCleaneddata.csv") Assumptions: Assumptions 1 :
  • 5. River Water quality is estimated on pH which is measured on continuous scale. For perfroming Multiple regression analysis dependent variable needs to be continuous. Here our data set supports Multiple regression. Assumption 2 : Second assumption include that we must have two or more independent variables ,which should be continuous or categorical. Supporting this argument we have two 2 independent variables viz. Temp,Alkalinity. Assumption 3 : Third assumption includes that our data should support independence of residuals. We can perform this by using Durbin-Watson Statistics. Below is the Model Summary for Durbin-Waston. As we know Durbin-Watson Statistic Value should lie between 1.5 and 2.5 ,here our d=0.433 i.e. 1.5 < d < 2.5 .Therefore, we can conclude here that linear autocorrection is not present. Assumption 4: In this assumption ,There is a need to analyse relationship between each independent variable with the dependent variable. There are number of ways to check the linear relationship ,here we have considered Scatterplot to analyse .Here we can see that ,for our dependent variable pH and the two independent variables temperature and Alkalinity, there exists a linear relationship. Therefore we are going to consider these 2 independent variables in our regression.
  • 6. Assumption 5: According to 5th assumption ,our data must follow homoscedasticity i.e. variance in the data must be similar along the best fit line. Normal P-P Plot of Regression Standardized Residual Dependent Variable proves our assumption to be correct. Therefore, chances of error in our dataset are negligible according to the below graph.
  • 7. Assumption 6 : According to assumption 6 ,our data set must not be multicollinear. There must not be any relation between our independent variables i.e. Temp and Alakline. Thjs can be analysed by using below table .Based on the outputs of Coefficients for Collinearity Statisctics ,we have VIF=1.263 for both the independent variables .So here we can conclude that ,as the value lies between 1 to 10 ,there is no collinearity symptoms in our data.
  • 8. Assumption 7:According to this assumption our data set must not contain any outliers. This outliers can have impact on the regression results. Therefore ,for predicting value for dependent variables from independent variables can have negative effect. Therefore from the below histogram we can see that our dataset doesn’t contain any outliers. Assumption 8 : Our assumption includes that we need to check for the residuals are normally distributed.Here our Normal P-P plot of standardized residual for dependent variable pH shows that residuals are normally distributed.[1]
  • 9. Interpreting Output For Multiple regression: 1.Evaluating how well the model fits: This table provides us with value R i.e. the multiple correlation coefficient, which can be used to predict the value of variance for our dependent variable pH. Here ,in our case value of R = 0.711,which resembles a good prediction value.
  • 10. 2.Estimated model coefficients: To estimate pH from Temperature and alkaline is given by : predicted pH = 7.156 - .003*Temp + .005 * Alkaline which is calculated using below coefficient table 3.Statistical Significance: We can check whether overall regression model is a good fit for the data using F-ratio in the ANOVA .Here F(2,502)=256.240, p < .0005,which shows independent variables statistically significantly predict the dependent variables for the good fit of data.
  • 11. 4.Statistical Significance of independent variables: Here ,we are checking the statistical significance of Temperature and alkaline, our independent variables. This test is carried out for checking whether the standardized value are equal to zero. Here t= -0.297 & sig=0.766 for Temp and t= 20.274 & sig=0.00 ,here p(sig)< 0.05 shows that the coefficients are statistically significantly. Result: Here we performed multiple regression analysis to predict the pH of River Water Quality from temperature and alkaline. These independent variables were used to predict statistically significant pH, F(2,502)=256.240, P < .0005,R2 =.505 .These two variables added statistical significance to predict p<0.05.
  • 12. LOGISTIC REGRESSION ANALYSIS Data Source: Logistic regression analysis was performed on diabetes. The source of data is as follows: https://data.gov.uk/search?q=diabetes Objective : The main objective behind choosing the data is the factors that play role in developing diabetes. Gender of the person plays an important role in this. Aim of this analysis is to : 1. Study the gender suffering diabetes. 2. Predict the age which is more prone to diabetes. Data Information: The dataset used here for analysis delineates the information about diabetes affected people which involves factors such as gender ,age ,cholesterol, height and weight. As we know that for performing logistic regression analysis, our dataset must contain dichotomous categorical variable. Here we have gender as dichotomous variable, where male is coded as 0 and female is coded as 1. This task was performed using R programming language. Variable view of the dataset is as follows: Software: I have used R for this analysis as it is most suitable technique to perform statistical analysis. R code for importing data is as follows:
  • 13. LogisticRegression <- read.csv("E:/NCI/sem1/STATS/diabetes.csv", header=T, na.strings=c(""), stringsAsFactors = T) Appropriate cleaning was done for the data and null values were removed from the dataset. Also unused columns were removed using R code. LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$glyhb), ] LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$frame), ] LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$height), ] LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$chol), ] LogisticRegression <- LogisticRegression[!is.na(LogisticRegression$weight), ] LogisticRegression<LogisticRegression[,c("chol","age","gender","height","we ight")] Assumptions: For analysing our data for logistic regression we need to make sure that data we are using can actually be used for logistic regression[1]. This can be done by considering the following assumptions .If our dataset passes the following assumptions we can consider this data to perform logistic regression. Assumption 1: According to the 1st assumption our dependent variable must be dichotomous categorical variable. We have our dependent variable is gender which was further coded as Male=1 and female=0 using the following R code LogisticRegression$gender<- ifelse(LogisticRegression$gender =="male",1,0) Here our 1st test of assumption is passed ,so that we can consider this data for logistic regression. Assumption 2 : Second assumption states that we must have one or more independent variables. These variables can either be continuous or categorical .Here our
  • 14. data set includes continuous variables in as Weight ,Age and Height and also one categorical variable. Therefore we can consider this dataset for logistic regression. Assumption 3 : For this assumption , we can observe the independence of observations. Here Gender is mutually exhaustive and exclusive. Our assumption is correct and we can perform logistic regression using this dataset. Assumptions 4 : According to this assumption ,our independent variables Weight ,Age, Chol and Height possess a linear relationship with Gender which is our dependent variable.[1] Execution Of Logistic Regression Using R : Model Fitting : For fitting the model we have used following command : mylogit <- glm(gender ~ chol + age + height + weight ,data = LogisticRegression,family = "binomial") summary(mylogit) Executing the above commands give the summary of model as follows: Deviance Residuals: Deviance Residuals is the estimation for model for its fitness. Below table delineates the distribution for Deviance Residuals for every individual. Min 1Q Median 3Q Max -3.4319 -0.4721 -0.1757 0.4555 3.9251
  • 15. Summary of the model is as follows: Estimate Std. Error z value Pr(>|z|) (Intercept) 50.074467 5.078074 -9.861 < 2e-16 *** chol -0.002041 0.003540 -0.577 0.564163 age 0.037806 0.010689 3.537 0.000405 *** height 0.753908 0.076356 9.874 < 2e-16 *** weight -0.009551 0.004100 -2.329 0.019844 * This part shows the details of coefficients .It includes the Estimates, Standard error ,z-values and the p values. Logistics regression shows the change in logs for every one unit in the supporting predictor variables. Here if the values of chol(cholestrol), age, height or weight changes by one unit ,then the odds of dependent variable Gender changes by -0.002041, 0.037806, 0.753908 or - 0.009551 respectively. (Dispersion parameter for binomial family taken to be 1) Null deviance: 507.07 on 372 degrees of freedom Residual deviance: 259.62 on 368 degrees of freedom AIC: 269.62 Number of Fisher Scoring iterations: 6 Interpretation: R reports two evidences – Null and Residual. Null deviance describes the response variable prediction of model. Residual predicts this by adding the independent variables. Akaike Information Criterion (AIC) can de used to define the quality of model. We can use the anova() for analysing the deviance. anova(mylogit,test = "Chisq")
  • 16. Analysis of Deviance Table Model: binomial, link: logit Response: gender Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 372 507.07 chol 1 0.576 371 506.49 0.44807 age 1 2.409 370 504.08 0.12064 height 1 238.844 369 265.24 < 2e-16 *** weight 1 5.622 368 259.62 0.01773 * Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 library(pscl) pR2(mylogit) llh llhNull G2 McFadden r2ML r2CU -129.80828 39 -253.53348 77 247.45040 75 0.4880034 0.4849060 0.6524635 Here there is no exact equivalent to the R2 ,therefore we can use McFadden R2 to estimate model fit. Evaluating the predictive ability of Logistic Regression Model: fitted.results <- predict(mylogit,newdata = subset(LogisticRegression, select=c("chol","age","gender","height","weight")), type = "response") fitted.results <- ifelse(fitted.results > 0.5,1,0) misClasificError <- mean(fitted.results != LogisticRegression$gender)
  • 17. print(paste("Acurracy",1-misClasificError)) "Acurracy 0.849865951742627" 0.84 accuracy is for this test is a good result. library(ROCR) p <- predict(mylogit,newdata = subset(LogisticRegression,select=c("chol","ag e","gender","height","weight")),type = "response") pr <- prediction(p,LogisticRegression$gender) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf) auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] [1] 0.9325298 The Following ROC plot shows plotted against True Positive Rate Vs False Positive Rate. Here we have ROC to measure AUC(Area Under The Curve) whic h can be used to estimate performance of binary classifier. As per the thumb rule ,model with better predictivity should have AUC closer to 1 than 0.5. Here we have TPR close to 1 ,which means our model is perfect.
  • 18. References 1. SPSS Survival Manual by Julie Pallant. 2. https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss -statistics.php 3. https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/