Regression and Classification Analysis

National College of Ireland
Project Submission Sheet – 2018/2019
Student Name: Yash Balaji Iyengar
………………………………………………………………………………………………………………
Student ID: X18124739
………………………………………………………………………………………………………………
Programme: Msc Data Analytics Cohort B
………………………………………………………………
Year: 2019-2020
………………………
Module: Statistics in Data Analytics
………………………………………………………………………………………………………………
Lecturer: Tony Delaney
………………………………………………………………………………………………………………
Submission Due
Date:
7th
April 2019
………………………………………………………………………………………………………………
Project Title: Statistics Continuous Assessment 2
………………………………………………………………………………………………………………

Word Count: 1718………………………………………………………………………………………………………………
I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at the
rear of the project.
ALL internet material must be referenced in the references section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author's written or electronic work is illegal (plagiarism) and may result in
disciplinary action. Students may be required to undergo a viva (oral examination) if
there is suspicion about the validity of their submitted work.
Signature: ………………………………………………………………………………………………………………
Date: 7/04/2019
…………………………………………………………………………………………………………
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. Projects should be submitted to your Programme Coordinator.
3. You must ensure that you retain a HARD COPY of ALL projects, both for your own reference
and in case a project is lost or mislaid. It is not sufficient to keep a copy on computer. Please
do not bind projects or place in covers unless specifically requested.
4. You must ensure that all projects are submitted to your Programme Coordinator on or before
the required submission date. Late submissions will incur penalties.
5. All projects must be submitted and passed in order to successfully complete the year. Any
project/assignment not submitted will be marked as a fail.

Office Use Only
Signature:
Date:
Penalty Applied (if applicable):
MULTIPLE LINEAR REGRESSION
Introduction: In statistical analysis, Regression is a set of statistical processes which is
used to understand the relationship between variables. Multiple Regression is used to
predict the value of a variable when there are two or more other variables. The variable we
predict is called the dependent variable and the variables which we use to predict the value
are called the independent variables.
Data Description:
Datasets are downloaded from the following link:
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHS2_3070_cancer
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000011
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aTOBACCO_0000000344
Three datasets have been merged to form one single dataset. The dataset consists of
• Number of Deaths due to Cancer
• Alcohol Consumption among adults in (Liters)
• Current users of any tobacco products (rate of users)
Here my dependent variable is "Number of Deaths due to Cancer" and my independent
variables are "Alcohol Consumption among adults" and "Current users of any tobacco
products". Here the model will try to predict the causes of death due to the independent
variables. We have taken a total of 152 samples for this model.
Assumptions:
1) Our dependent variable "Number of Deaths due to Cancer" is a continuous variable.
It is a count of deaths and can be measured on a continuous scale.
2) There are two independent variables "Alcohol Consumption among adults" and
"Current users of any tobacco products" both are continuous in nature.
3) Here we check for auto-correlation between the observations.

The Durbin-Watson test checks for auto-correlation between the observations if the value is
between 1.5 and 2.5, we can conclude that the observations are not auto-correlated. Our
value is 1.993, therefore we can conclude that our observations are independent. Here the
R square value gives us the strength of the model. The R square value is 12%.
4) We will now check for linearity between dependent and independent variables.
From the above figure, we can see that the variables are linearly distributed and there are
no outliers.
5) Lets now check for homoscedasticity.

From the scatterplot we can see that there is no specific pattern the variable is plotted
in except for a couple of outliers, so we can say that the variance of the data remains
similar along the best fit line. This means our data is homoscedastic in nature.
6) Let us now check for multi-collinearity between the independent variables
We can see from the above table that the dependent variable Deaths is correlated with
Alcohol Consumption as the Pearson value is 0.353 which is above (0.3) but the
correlation of Tobacco Consumption with Death is low as the Pearson Correlation value
is 0.060 which is less than 0.3.

We can see that multicollinearity does not exist between the two independent variables
Tobacco Consumption and Alcohol Consumption as the Pearson Correlation value is
0.169 which is less than 0.70.
7) Let’s check our data for Normality
From the above histogram, we can observe that the data is normally distributed except
for one outlier.
SPSS Output Interpretation:

From the above table, we can observe that there152 samples. Means and standard
deviations for all the variables is calculated.
• From the above table we can check the significance value of each independent
variable. The significance value should be less than 0.05 and it shows how much
of an impact it has on the dependent variable.
• Alcohol consumption has a significant impact on the number of Deaths but on
the other hand Tobacco consumption has almost no impact on the deaths.
• Also, the Unstandardized Coefficient column tells us about the slope of the best
fit line. From the observed values we can draw the regression line.
• The Tolerance explains collinearity and the that value should be above 0.1 to
avoid multi-collinearity our value is 0.971 also the VIF value should be less than
10 our value is 1.029.
From the above table we can interpret the following information:
• Sum of Squares column shows that 27481.190 observations were predicted out
of 220223.520. The significance value is 0.
• Also, our model predicts 2 out of 151 degrees of freedom.

Result:
As the analysis has been conducted, we have obtained the regression equation as follows:
Deaths = 130.646 + 0(Tobacco_Con) + 3.388(Alco_Con)
Since the constant for the Tobacco_Con is 0 it means it does not contribute to predicting the
cause of death. So, the equation becomes like this.
Deaths = 130.646 + 3.388(Alco_Con)
So, we see that the coefficient of an independent variable is the amount of change that
occurs in the dependent variable. So, multiple regression analysis checks what effect does
the independent variables have on predicting the dependent variable.

Binary Logistic Regression
Introduction:
Logistic Regression is used to predict a dichotomous dependent variable with the help of
one or more continuous or categorical variable.
Data Description:
Datasets are downloaded from the following link:
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHS2_3070_cancer
http://data.un.org/Data.aspx?d=WHO&f=MEASURE_CODE%3aWHOSIS_000011
The data consists of two columns
• Deaths (Factor of Yes/No)
• Alcohol Consumption (Liters)
"Deaths" column is the dichotomous dependent variable, I have coded Yes for 1 and No for
0 and "Alcohol Consumption" is the continuous independent variable. Here the model tries
to predict if the death occurs due to Alcohol consumption.
Assumptions:
1) The dependent variable should be dichotomous. “Deaths” is a dichotomous variable.
2) There should be at least one or more independent variable. Alcohol Consumption is
our independent and continuous variable.
3) The sample size should be large. We have a dataset of 152 samples.
4) Since we have only one independent variable multi-collinearity won’t occur.
5) Let us check for Outliers in the Data.
Case-wise listing was not produced since there are no outliers in the data.

SPSS Output Interpretation:
From the case processing summary, we can observe that all the samples have been
processed, total number of samples is 152.
Now there are two blocks of outputs. Block zero is the case where SPSS runs the model
without providing independent variables. Let us interpret its results as follows:
The block 0, classification table shows that the model predicts that the deaths do not occur
for all the cases. This happens because independent variable is not provided to the model.
Therefore, the model predicts only 54.6 % values as correct.
In block 1, we see the table Omnibus Tests of Model Coefficients. Here the model is tested
with the predictor variables. This is the goodness of fit test where the results are compared
with block zero to check if the predictor variables have had an impact on the dependent

variable. The significance value should be less than 0.05. Our table shows 0.013 which
means the independent variable has an impact on the dependent variable.
The Cox and Snell R square and Nagelkerke R square values both show the amount of
variation the model has on the dependent variable. It means that 3.9% to 5.3% of the
variation in the dependent variable is due to the model.
This classification table belongs to block 1 and shows prediction that is 61.2% which is better
than the block 0 prediction that is 54.6%. This happens because here the predictor variables
have been included in the model processing.
The Hosmer Lemeshow test is also used to check for goodness of fit. The significance value
should be greater than 0.05. Our significance value is 0.06 which means our model is good.

This table tells us how much the independent variable contributes to model prediction.
The significance value should be less than 0.05. Significance value for Alcohol consumption
is 0.015. B value is the constant, it gives the amount of effect it has on the dependent
variable.
Result:
Based on the Logistic regression analysis we get the following regression equation:
Deaths = 0.103 – 0.653(Alco_Con)
Once we replace the independent variable that is Alcohol Consumption value in the above
equation, we will get the probability. If the probability is higher than 0.5 then there is a
chance that death might occur and if the probability is less than 0.5 then the death might
not occur.
References
• SPSS survival manual by Julie Pallant third edition.
• https://statistics.laerd.com/spss-tutorials/binomial-logistic-regression-
using-spss-statistics.php
• https://statistics.laerd.com/spss-tutorials/multiple-regression-using-
spss-statistics.php

Regression and Classification Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Regression and Classification Analysis

Similar to Regression and Classification Analysis (20)

More from YashIyengar

More from YashIyengar (9)

Recently uploaded

Recently uploaded (20)

Regression and Classification Analysis