Project ppt

R PROGRAMMING PROJECT
LOGISTIC REGRESSION ANALYSIS
TO DETECT THE PRESENCE OF
HEART DISEASES

CONTRIBUTION
Kritika Jain and Vanya Vasudeva: coding,
presentation and research
Vansh Puri and Anant Goyal: data
cleaning

Introduction
Heart disease is one of the top leading causes of death accounting for 17.7 million deaths each year,
31% of all global deaths, as reported by World Health Organization 2017.
Several clinical information and symptoms are found to be related to Heart Diseases including age,
blood pressure, total cholestrol, diabetes, hyper tension .
Heart Disease dataset basically consists of the above-mentioned information and attributes which
was summarized and collected from the patients.
With the huge amounts of data made available in recent years, the diagnosis of Heart Diseases can be
automatically performed using traditional statistical methods to predict the potential of having Heart
Diseases on each patient.

Aim of analysis
 The aim of analyzing this data set is to predict the people suffering from heart
diseases based on certain common factors.
 Since the prediction has to be in a yes or no format, the dependent variable is
categorical and thus can be regressed using logistic regression or random forest.

EXPLORATORY DATA
ANALYSIS
 Exploratory Data Analysis refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test hypothesis and the check
assumptions with the help of summary statistics and graphical representations. It employs
a variety of techniques to:
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings

Components of EDA
 The main components of exploring data,
1. Understanding your variables
 Importing dataset
 Change structure
 Summary statistics: calculating mean, median, mode, skewness, kurtosis, variance,etc
2. Cleaning your dataset
 Removing redundant variables (NA values)
 Variable selection
 Removing outliers: using boxplots
3. Analysing the variables
 Visualising data : histograms, bar plots, scatterplot, etc
 Checking for correlation : using corrplots
 Creating a model
 Using logistic regression to make predictions
 Check for accuracy of the model

What is logistic regression analysis?
Logistic regression is the appropriate regression
analysis to conduct when the dependent variable is
dichotomous (binary) or categorical. Like all
regression analysis, the logistic regression is a
predictive analysis. Logistic regression is used to
describe data and to explain the relationship
between one dependent binary variable and one or
more nominal, ordinal, interval or ratio-level
independent variables

Major assumptions in binary
logistic regression
 The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
 There should be no outliers in the data.
 OLS assumptions should be satisfied.
 At the center of the logistic regression analysis is the task estimating the log odds of an
event. Mathematically, logistic regression estimates a multiple linear regression function
defined as:
log(p)
for i = 1…n .
Regression coefficients explain the change in log(odds) in the response for a unit change in
predictor. However, since the relationship between p(X) and X is not straight line, a unit change in
input feature doesn't really affect the model output directly but it affects the odds ratio.

 age - age in years
 sex - (1 = male; 0 = female)
 cp - chest pain type
0: Typical angina: chest pain related to decreased blood supply to the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain:typically esophaegal spasms; non heart related
3: Asymptomatic: chest pain not showing signs of disease
 trestbps - resting blood pressure (in mm Hg on admission to the hospital)
above 130-140 - cause for concern
 chol - serum cholestrol in mg/dl
above 200 is cause for concern
Understanding the variables

 fbs - (fasting blood sugar > 120 mg/dl)
(1 = true; 0 = false)
'>126' mg/dL signals diabetes
 restecg - resting electrocardiographic results
0: Nothing to note
1: can range from mild symptoms to severe problems
signals non-normal heart beat
2: Enlarged heart's main pumping chamber
 thalach - maximum heart rate achieved
 exang - exercise induced angina
(1 = yes; 0 = no)
 oldpeak – stress of the heart during excercise
unhealthy heart stresses more

 slope - the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise
1: Flatsloping: typical healthy heart
2: Downslopins: signs of unhealthy heart
 ca - number of major vessels (0-3) colored by flourosopy
-colored vessel means the doctor can see the blood passing through
- more the blood movement, better is the functioning of the heart (no clots)
 thal - thalium stress result
0,1: normal
2: fixed defect
3: reversable defect: no proper blood movement when exercising
 target - have disease or not
(1=yes, 0=no)
- the predicted attribute

Changes made to variables
 For simplification, we changed the variable names.
 Cp=chest_pain_type
 Trestbps= rest_bp
 Chol= cholesterol
 Fbs= fast_bs
 Restecg= rest_ecg
 Thalach= max_hr
 Exang= ex_induced_angina
 Old peak= ST_dep
 Ca= vessels
 Thal= defect

Other changes to data
 The classes of the following variables are converted into factors to accurately define their
context:
 Column age has been converted into an integer
 Neither NA nor missing values were found in the data
 Slope
 Vessels
 Defect
 Target
 Sex
 Chest_pain_type
 Fast_bs
 Rest_ecg
 Ex_induced_angina

Summary statistics of data
 Summary statistics refers to a quick description of the data basically including mean, median, mode,
skewness, kurtosis, variance and standard deviation of various variables. Some of these measures
variables that are numeric in nature.

Removing outliers
To remove the outliers, we use boxplots.
We plot the data on boxplot and further replace the outliers with mean, median or mode.
Maximum heart rate
(before and after)

cholesterol
(before and after)

resting blood pressure
(before and after)

ST depression
(before and after)

Data visualizations
 Data visualization refers to plotting the data in histograms, bargraphs, scatterplots, etc for easier
and efficient understanding of the data.
 A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars
with heights or lengths proportional to the values that they represent.
 Histogram: a graphical display of data using bars of different heights. It is similar to a Bar Chart,
but a histogram groups numbers into ranges . The height of each bar shows how many fall into
each range.
 A scatterplot is a type of data display that shows the relationship between two numerical
variables. Each member of the dataset gets plotted as a point whose ( x , y ) (x, y) (x,y)left
parenthesis, x, comma, y, right parenthesis coordinates relates to its values for the two variables.

Histograms
In order to make the histograms, we filtered the data using the “DPLYR package” to get the
Data of people suffering from heart disease. We furthered used the “HIST” function to create
The histograms.
It shows the number of people of a particular age, suffering from
heart disease.
As you age, so do your blood vessels. They become less flexible,
making it harder for blood to move through them easily
We observe that people in the age group of 40-65 are most likely
To suffer from heart disease.

 It shows the frequency of of ST depressions undergone by
patients suffering from heart disease.
 ST depression shows the stress of the heart induced by
exercise.
 Lower the ST depression, greater is the stress of the heart.
 Heart patients experience relatively higher stress of the
heart.

It shows that the maximum number of people having
heart diseases have a resting bp between 120-140mm Hg.
A very high or very low blood pressure are both causes of
concern for heart patients.
The graph shows that heart patients tend to have a high bp.
The normal range for blood pressure is 80-120mm Hg

 It shows that that the maximum number of people suffering
from heart disease have cholesterol levels between 200 to
240.
 According to research, Cholesterol levels above 200 is a
cause for concern.

 Heart rate is the speed of the heartbeat measured by the
number of contractions (beats) of the heart per minute
(bpm).
 The heart rate can vary according to the body's physical
needs, including the need to absorb oxygen.
 It shows that the heart disease patients generally have a
maximum heart rate of 160-180.
 Researches show that maximum heart rate should not fall too
low and it should not rise too high either.

Scatterplot
The scatterplot shows a downward relation
between age and maximum heart rate achieved.
As age increases, maximum heart rate achieved falls
for a diseased person.

Mixed Corrplot
 There should be no high correlations (multicollinearity)
among the predictors. This can be assessed by a
correlation matrix among the predictors.
 We created a data set of numeric variables and
calculated the correlation amongst the variables.
 We further used the code of a mixed corrplot to
visualize our calculations.
Since the correlation between the variables is less than 0.90,
these variables can be used in the model.

Corrplot
INSIGHTS:
 As age increases, cholesterol, stress of heart during exercise and resting bp
also increase. On the other hand, maximum heart rate falls with old age.
 As cholesterol increases, stress of heart during exercise and resting bp
increase, while maximum heart rate falls.
 As ST depression rises, i.e. stress of the heart falls, resting bp rises.
 Resting bp also has a negative relation with maximum heart rate.
 The degree of correlation is very small between all variables. However, age
and maximum heart rate show a slightly higher correlation.
 St depression and maximum heart rate also show similar results.

Bar chart
We observe that the number of males are
higher than the number of females in our dataset.
We thus assumed that it is a biased data.
However, we later realized that a larger proportion of
male population going for heart check ups implies a
higher degree of risk for males.

Creating a model
 We use LOGISTIC REGRESSION because the dependent variable, in our case,
target (whether a person suffers from heart disease or not) is categorical in nature.
 1: person suffers from heart disease
 0: person does not suffer from heart disease

Division into train data and test data
 We use the “CARET package” to divide the entire dataset into 2 parts:
 Train: this part of the dataset is used for model building. Analysis is done for this
dataset and an appropriate model is built according to requirements.
 Test: this part of the dataset is used to test the model. The output of this dataset is
is compared with the original output and the accuracy of the original model can be
predicted.
The function “CreateDataPartition” is used to split the data into 60% for training set
and 40% for testing set.
We do this to predict the dependent variable target.

Variable selection
 Method 1: BACKWARD SELECTION
Backward selection (or backward elimination), which starts with all predictors in the model ,
iteratively removes the least contributive predictors, and stops when you have a model where all
predictors are statistically significant.
Using backward selection, the following variables are found to be statistically significant in our
analysis:
Hence, we build a model from the variables selected through backward elimination.
• Sex
• Chest pain type
• Rest bp
• Ex induced angina
• ST depression
• Slope
• Vessels
• defect

Result of Backward Elimination
Model1:

Checking for multicollinearity
 The variables should not be correlated amongst themselves in
order to have better accuracy.
 We check the variance inflation factor(VIF) inorder to check for
multicollinearity.
 If the VIF of a variable is greater than 5, it indicates possibility of
multi collinearity. Therefor we will remove any variable that has a
VIF>5.
 However VIF can only be checked for numeric data

Since none of the variables have a VIF >5, therefore we use all
variables.

 Method2: random forest
Random forest, like its name implies, consists of a large number of individual decision
trees that operate as an ensemble. Each individual tree in the random forest spits out a
class prediction and the class with the most votes becomes our model’s prediction.

 We applied random forest on our entire dataset
in order to get the variable importance plot.
 Variable importance plot shows the mean
decrease accuracy, which represents by how
much does removing each variable reduces the
accuracy of the model.
 Higher the value of mean decrease accuracy or
mean decrease gini score, higher the importance
of variable in the model.
 With the least mean decrease accuracy, Resting
bp, fasting blood sugar and resting ecg were
eliminated from our model.
VARIABLE IMPORTANCE PLOT

Model 2:
 With the help of mean decrease accuracy, we build a
logistic model using:
 age, sex, chest pain type,cholesterol, maximum heart rate,
exercise induced angina, ST depression, slope, vessels and
defect.

Accuracy and ROCR curve.
Model1:
Accuracy= 85.925%
Area under the curve(AUC)= 87.41%
Model2:
Accuracy= 79.33%
Area under the curve(AUC)= 86.52%
We compare the actual and predicted values of both models and the results are as follows:

Result
 Accuracy= correct predictions/ total predictions
 ROCR shows the area under the curve. Greater the
area, more reliable is the model.
 Since the accuracy and area under the curve of
Model1 is better, it is a better fit for analysis.

Confusion matrix
 Accuracy : the proportion of the total number of predictions that were correct.
 Positive Predictive Value or Precision : the proportion of positive cases that were correctly
identified.
 Negative Predictive Value : the proportion of negative cases that were correctly identified.
 Sensitivity or Recall or True Positive Rate : the proportion of actual positive cases which are
correctly identified.
 Specificity : the proportion of actual negative cases which are correctly identified.
 Precision : d/c+d

Accuracy of our model= 43+61/(43+61+5+12) = 85.95%

Testing the significance of regressors
 Null Hypothesis(Ho): the explanatory variable does not affect the dependent variable.
 Alternative Hypothesis(Ha): The explanatory variable affects the dependent variable.
 We check the summary of the model built with the help of backward elimination in
order to check their p value.
 If p<0.05 (5% level of significance), null hypothesis is rejected.
 P value of sex, chest pain type, ST depression and vessels is less than 0.05.
 Therefore, these variables should be added in our final model.

Model3:
The variables used in our model is based on the highest rated variables as per
the variable importance plot from random forest. Therefore, with the help of
the previous 2 models, we build our final model having variables sex, chest
pain type, ST depression and vessels.
All the 4 variables have high significance level with respect to their p-values.

Checking heteroscedasticity
We plot the fitted values with the residuals of
the logistic model in order to get the graph.
Since the scatterplot is not funnel shaped, this
mode is free from heteroscedasticity.

Checking accuracy
Accuracy of the model=
44+64/44+64+2+11=89.25%
Area under the curve= 91.55%

Conclusion
 This dataset is old and small by today's standards. However, it has allowed us
to create a simple model and use machine learning explainability tools and
techniques to peek inside.
 At the start, we hypothesised, using research knowledge that factors such as
cholesterol, age and fasting blood sugar would be major factors in the model.
However, this dataset didn't show that. Instead, the number of major factors
and aspects of ECG results dominated.

Insights
 Sex: we observe that a larger number of male population appears in heart check
ups. Thus males are more prone to heart diseases.
 Chest pain type: chest pain types have a major role in predicting heart diseases as
the patients suffering from angina pains have higher probability of suffering from
heart diseases. Typical angina pain is a major indicator of heart disease.
 ST depression: the stress of the heart during exercise causes higher risk of heart
disease. Thus higher the ST depression, lower is the risk.
 Vessels: vessels refer to the fluoroscopy test done as an indicator of blood flow
through the vessels. Darker the color of fluoroscopy, lower is the risk of heart
disease.

Project ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Project ppt

Similar to Project ppt (20)

Recently uploaded

Recently uploaded (20)

Project ppt