SlideShare a Scribd company logo
R PROGRAMMING PROJECT
LOGISTIC REGRESSION ANALYSIS
TO DETECT THE PRESENCE OF
HEART DISEASES
CONTRIBUTION
Kritika Jain and Vanya Vasudeva: coding,
presentation and research
Vansh Puri and Anant Goyal: data
cleaning
Introduction
Heart disease is one of the top leading causes of death accounting for 17.7 million deaths each year,
31% of all global deaths, as reported by World Health Organization 2017.
Several clinical information and symptoms are found to be related to Heart Diseases including age,
blood pressure, total cholestrol, diabetes, hyper tension .
Heart Disease dataset basically consists of the above-mentioned information and attributes which
was summarized and collected from the patients.
With the huge amounts of data made available in recent years, the diagnosis of Heart Diseases can be
automatically performed using traditional statistical methods to predict the potential of having Heart
Diseases on each patient.
Aim of analysis
 The aim of analyzing this data set is to predict the people suffering from heart
diseases based on certain common factors.
 Since the prediction has to be in a yes or no format, the dependent variable is
categorical and thus can be regressed using logistic regression or random forest.
EXPLORATORY DATA
ANALYSIS
 Exploratory Data Analysis refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test hypothesis and the check
assumptions with the help of summary statistics and graphical representations. It employs
a variety of techniques to:
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings
Components of EDA
 The main components of exploring data,
1. Understanding your variables
 Importing dataset
 Change structure
 Summary statistics: calculating mean, median, mode, skewness, kurtosis, variance,etc
2. Cleaning your dataset
 Removing redundant variables (NA values)
 Variable selection
 Removing outliers: using boxplots
3. Analysing the variables
 Visualising data : histograms, bar plots, scatterplot, etc
 Checking for correlation : using corrplots
 Creating a model
 Using logistic regression to make predictions
 Check for accuracy of the model
What is logistic regression analysis?
Logistic regression is the appropriate regression
analysis to conduct when the dependent variable is
dichotomous (binary) or categorical. Like all
regression analysis, the logistic regression is a
predictive analysis. Logistic regression is used to
describe data and to explain the relationship
between one dependent binary variable and one or
more nominal, ordinal, interval or ratio-level
independent variables
Major assumptions in binary
logistic regression
 The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
 There should be no outliers in the data.
 OLS assumptions should be satisfied.
 At the center of the logistic regression analysis is the task estimating the log odds of an
event. Mathematically, logistic regression estimates a multiple linear regression function
defined as:
log(p)
for i = 1…n .
Regression coefficients explain the change in log(odds) in the response for a unit change in
predictor. However, since the relationship between p(X) and X is not straight line, a unit change in
input feature doesn't really affect the model output directly but it affects the odds ratio.
 age - age in years
 sex - (1 = male; 0 = female)
 cp - chest pain type
0: Typical angina: chest pain related to decreased blood supply to the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain:typically esophaegal spasms; non heart related
3: Asymptomatic: chest pain not showing signs of disease
 trestbps - resting blood pressure (in mm Hg on admission to the hospital)
above 130-140 - cause for concern
 chol - serum cholestrol in mg/dl
above 200 is cause for concern
Understanding the variables
 fbs - (fasting blood sugar > 120 mg/dl)
(1 = true; 0 = false)
'>126' mg/dL signals diabetes
 restecg - resting electrocardiographic results
0: Nothing to note
1: can range from mild symptoms to severe problems
signals non-normal heart beat
2: Enlarged heart's main pumping chamber
 thalach - maximum heart rate achieved
 exang - exercise induced angina
(1 = yes; 0 = no)
 oldpeak – stress of the heart during excercise
unhealthy heart stresses more
 slope - the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise
1: Flatsloping: typical healthy heart
2: Downslopins: signs of unhealthy heart
 ca - number of major vessels (0-3) colored by flourosopy
-colored vessel means the doctor can see the blood passing through
- more the blood movement, better is the functioning of the heart (no clots)
 thal - thalium stress result
0,1: normal
2: fixed defect
3: reversable defect: no proper blood movement when exercising
 target - have disease or not
(1=yes, 0=no)
- the predicted attribute
Changes made to variables
 For simplification, we changed the variable names.
 Cp=chest_pain_type
 Trestbps= rest_bp
 Chol= cholesterol
 Fbs= fast_bs
 Restecg= rest_ecg
 Thalach= max_hr
 Exang= ex_induced_angina
 Old peak= ST_dep
 Ca= vessels
 Thal= defect
Other changes to data
 The classes of the following variables are converted into factors to accurately define their
context:
 Column age has been converted into an integer
 Neither NA nor missing values were found in the data
 Slope
 Vessels
 Defect
 Target
 Sex
 Chest_pain_type
 Fast_bs
 Rest_ecg
 Ex_induced_angina
Summary statistics of data
 Summary statistics refers to a quick description of the data basically including mean, median, mode,
skewness, kurtosis, variance and standard deviation of various variables. Some of these measures
variables that are numeric in nature.
Removing outliers
To remove the outliers, we use boxplots.
We plot the data on boxplot and further replace the outliers with mean, median or mode.
Maximum heart rate
(before and after)
cholesterol
(before and after)
resting blood pressure
(before and after)
ST depression
(before and after)
Data visualizations
 Data visualization refers to plotting the data in histograms, bargraphs, scatterplots, etc for easier
and efficient understanding of the data.
 A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars
with heights or lengths proportional to the values that they represent.
 Histogram: a graphical display of data using bars of different heights. It is similar to a Bar Chart,
but a histogram groups numbers into ranges . The height of each bar shows how many fall into
each range.
 A scatterplot is a type of data display that shows the relationship between two numerical
variables. Each member of the dataset gets plotted as a point whose ( x , y ) (x, y) (x,y)left
parenthesis, x, comma, y, right parenthesis coordinates relates to its values for the two variables.
Histograms
In order to make the histograms, we filtered the data using the “DPLYR package” to get the
Data of people suffering from heart disease. We furthered used the “HIST” function to create
The histograms.
It shows the number of people of a particular age, suffering from
heart disease.
As you age, so do your blood vessels. They become less flexible,
making it harder for blood to move through them easily
We observe that people in the age group of 40-65 are most likely
To suffer from heart disease.
 It shows the frequency of of ST depressions undergone by
patients suffering from heart disease.
 ST depression shows the stress of the heart induced by
exercise.
 Lower the ST depression, greater is the stress of the heart.
 Heart patients experience relatively higher stress of the
heart.
It shows that the maximum number of people having
heart diseases have a resting bp between 120-140mm Hg.
A very high or very low blood pressure are both causes of
concern for heart patients.
The graph shows that heart patients tend to have a high bp.
The normal range for blood pressure is 80-120mm Hg
 It shows that that the maximum number of people suffering
from heart disease have cholesterol levels between 200 to
240.
 According to research, Cholesterol levels above 200 is a
cause for concern.
 Heart rate is the speed of the heartbeat measured by the
number of contractions (beats) of the heart per minute
(bpm).
 The heart rate can vary according to the body's physical
needs, including the need to absorb oxygen.
 It shows that the heart disease patients generally have a
maximum heart rate of 160-180.
 Researches show that maximum heart rate should not fall too
low and it should not rise too high either.
Scatterplot
The scatterplot shows a downward relation
between age and maximum heart rate achieved.
As age increases, maximum heart rate achieved falls
for a diseased person.
Mixed Corrplot
 There should be no high correlations (multicollinearity)
among the predictors. This can be assessed by a
correlation matrix among the predictors.
 We created a data set of numeric variables and
calculated the correlation amongst the variables.
 We further used the code of a mixed corrplot to
visualize our calculations.
Since the correlation between the variables is less than 0.90,
these variables can be used in the model.
Corrplot
INSIGHTS:
 As age increases, cholesterol, stress of heart during exercise and resting bp
also increase. On the other hand, maximum heart rate falls with old age.
 As cholesterol increases, stress of heart during exercise and resting bp
increase, while maximum heart rate falls.
 As ST depression rises, i.e. stress of the heart falls, resting bp rises.
 Resting bp also has a negative relation with maximum heart rate.
 The degree of correlation is very small between all variables. However, age
and maximum heart rate show a slightly higher correlation.
 St depression and maximum heart rate also show similar results.
Bar chart
We observe that the number of males are
higher than the number of females in our dataset.
We thus assumed that it is a biased data.
However, we later realized that a larger proportion of
male population going for heart check ups implies a
higher degree of risk for males.
Creating a model
 We use LOGISTIC REGRESSION because the dependent variable, in our case,
target (whether a person suffers from heart disease or not) is categorical in nature.
 1: person suffers from heart disease
 0: person does not suffer from heart disease
Division into train data and test data
 We use the “CARET package” to divide the entire dataset into 2 parts:
 Train: this part of the dataset is used for model building. Analysis is done for this
dataset and an appropriate model is built according to requirements.
 Test: this part of the dataset is used to test the model. The output of this dataset is
is compared with the original output and the accuracy of the original model can be
predicted.
The function “CreateDataPartition” is used to split the data into 60% for training set
and 40% for testing set.
We do this to predict the dependent variable target.
Variable selection
 Method 1: BACKWARD SELECTION
Backward selection (or backward elimination), which starts with all predictors in the model ,
iteratively removes the least contributive predictors, and stops when you have a model where all
predictors are statistically significant.
Using backward selection, the following variables are found to be statistically significant in our
analysis:
Hence, we build a model from the variables selected through backward elimination.
• Sex
• Chest pain type
• Rest bp
• Ex induced angina
• ST depression
• Slope
• Vessels
• defect
Result of Backward Elimination
Model1:
Checking for multicollinearity
 The variables should not be correlated amongst themselves in
order to have better accuracy.
 We check the variance inflation factor(VIF) inorder to check for
multicollinearity.
 If the VIF of a variable is greater than 5, it indicates possibility of
multi collinearity. Therefor we will remove any variable that has a
VIF>5.
 However VIF can only be checked for numeric data
Since none of the variables have a VIF >5, therefore we use all
variables.
 Method2: random forest
Random forest, like its name implies, consists of a large number of individual decision
trees that operate as an ensemble. Each individual tree in the random forest spits out a
class prediction and the class with the most votes becomes our model’s prediction.
 We applied random forest on our entire dataset
in order to get the variable importance plot.
 Variable importance plot shows the mean
decrease accuracy, which represents by how
much does removing each variable reduces the
accuracy of the model.
 Higher the value of mean decrease accuracy or
mean decrease gini score, higher the importance
of variable in the model.
 With the least mean decrease accuracy, Resting
bp, fasting blood sugar and resting ecg were
eliminated from our model.
VARIABLE IMPORTANCE PLOT
Model 2:
 With the help of mean decrease accuracy, we build a
logistic model using:
 age, sex, chest pain type,cholesterol, maximum heart rate,
exercise induced angina, ST depression, slope, vessels and
defect.
Accuracy and ROCR curve.
Model1:
Accuracy= 85.925%
Area under the curve(AUC)= 87.41%
Model2:
Accuracy= 79.33%
Area under the curve(AUC)= 86.52%
We compare the actual and predicted values of both models and the results are as follows:
Result
 Accuracy= correct predictions/ total predictions
 ROCR shows the area under the curve. Greater the
area, more reliable is the model.
 Since the accuracy and area under the curve of
Model1 is better, it is a better fit for analysis.
Confusion matrix
 Accuracy : the proportion of the total number of predictions that were correct.
 Positive Predictive Value or Precision : the proportion of positive cases that were correctly
identified.
 Negative Predictive Value : the proportion of negative cases that were correctly identified.
 Sensitivity or Recall or True Positive Rate : the proportion of actual positive cases which are
correctly identified.
 Specificity : the proportion of actual negative cases which are correctly identified.
 Precision : d/c+d
Accuracy of our model= 43+61/(43+61+5+12) = 85.95%
Testing the significance of regressors
 Null Hypothesis(Ho): the explanatory variable does not affect the dependent variable.
 Alternative Hypothesis(Ha): The explanatory variable affects the dependent variable.
 We check the summary of the model built with the help of backward elimination in
order to check their p value.
 If p<0.05 (5% level of significance), null hypothesis is rejected.
 P value of sex, chest pain type, ST depression and vessels is less than 0.05.
 Therefore, these variables should be added in our final model.
Model3:
The variables used in our model is based on the highest rated variables as per
the variable importance plot from random forest. Therefore, with the help of
the previous 2 models, we build our final model having variables sex, chest
pain type, ST depression and vessels.
All the 4 variables have high significance level with respect to their p-values.
Checking heteroscedasticity
We plot the fitted values with the residuals of
the logistic model in order to get the graph.
Since the scatterplot is not funnel shaped, this
mode is free from heteroscedasticity.
Checking accuracy
Accuracy of the model=
44+64/44+64+2+11=89.25%
Area under the curve= 91.55%
Conclusion
 This dataset is old and small by today's standards. However, it has allowed us
to create a simple model and use machine learning explainability tools and
techniques to peek inside.
 At the start, we hypothesised, using research knowledge that factors such as
cholesterol, age and fasting blood sugar would be major factors in the model.
However, this dataset didn't show that. Instead, the number of major factors
and aspects of ECG results dominated.
Insights
 Sex: we observe that a larger number of male population appears in heart check
ups. Thus males are more prone to heart diseases.
 Chest pain type: chest pain types have a major role in predicting heart diseases as
the patients suffering from angina pains have higher probability of suffering from
heart diseases. Typical angina pain is a major indicator of heart disease.
 ST depression: the stress of the heart during exercise causes higher risk of heart
disease. Thus higher the ST depression, lower is the risk.
 Vessels: vessels refer to the fluoroscopy test done as an indicator of blood flow
through the vessels. Darker the color of fluoroscopy, lower is the risk of heart
disease.
THANK YOU!

More Related Content

What's hot

Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
hktripathy
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
amiteshg
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction system
SWAMI06
 
Data Analysis
Data AnalysisData Analysis
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
kalung0313
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
Kedar Damkondwar
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Paras Kohli
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
Chathurangi Shyalika
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
Farah M. Altufaili
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
mlong24
 
Exploring Data
Exploring DataExploring Data
Exploring Data
Datamining Tools
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
Mostafa G. M. Mostafa
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Salah Amean
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease Prediction
Mustafa Oğuz
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
Nirmalavenkatachalam
 
Bayesian intro
Bayesian introBayesian intro
Bayesian intro
BayesLaplace1
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
SUJIT SHIBAPRASAD MAITY
 
Prediction of cardiovascular disease with machine learning
Prediction of cardiovascular disease with machine learningPrediction of cardiovascular disease with machine learning
Prediction of cardiovascular disease with machine learning
Pravinkumar Landge
 

What's hot (20)

Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
 
Heart disease prediction system
Heart disease prediction systemHeart disease prediction system
Heart disease prediction system
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Exploring Data
Exploring DataExploring Data
Exploring Data
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease Prediction
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
Bayesian intro
Bayesian introBayesian intro
Bayesian intro
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
Prediction of cardiovascular disease with machine learning
Prediction of cardiovascular disease with machine learningPrediction of cardiovascular disease with machine learning
Prediction of cardiovascular disease with machine learning
 

Similar to Project ppt

Heart attack possibility.pptx
Heart attack possibility.pptxHeart attack possibility.pptx
Heart attack possibility.pptx
PavithraAbeysiriward
 
Heart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning AnalysisHeart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning Analysis
Boston Institute of Analytics
 
Heart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning AnalysisHeart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning Analysis
Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
Boston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Boston Institute of Analytics
 
Heart Stats(4)-1
Heart Stats(4)-1Heart Stats(4)-1
Heart Stats(4)-1
Lei Barr
 
Heart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptxHeart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptx
Boston Institute of Analytics
 
biostat 2.pptx h h jbjbivigyfyfyfyfyftftc
biostat 2.pptx h h jbjbivigyfyfyfyfyftftcbiostat 2.pptx h h jbjbivigyfyfyfyfyftftc
biostat 2.pptx h h jbjbivigyfyfyfyfyftftc
MrMedicine
 
Tom Nguyen - SAS Project
Tom Nguyen - SAS ProjectTom Nguyen - SAS Project
Tom Nguyen - SAS Project
Tom Nguyen
 
CARDIO docx
CARDIO docxCARDIO docx
CARDIO docx
JamilaBullan
 
Hypertention ppt
Hypertention pptHypertention ppt
Hypertention ppt
Neelam Yadav
 
Pro / Con Debate on Central Blood Pressure
Pro / Con Debate on Central Blood PressurePro / Con Debate on Central Blood Pressure
Pro / Con Debate on Central Blood Pressure
magdy elmasry
 
A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision Tree
IOSR Journals
 
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCEPREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
ijaia
 
Vital signs - Blood pressure Monitoring
Vital signs - Blood pressure Monitoring Vital signs - Blood pressure Monitoring
Vital signs - Blood pressure Monitoring
Arsi University, Asella, Ethiopia
 
Pulse-Dynamics_e-Book (2013)
Pulse-Dynamics_e-Book (2013)Pulse-Dynamics_e-Book (2013)
Pulse-Dynamics_e-Book (2013)
Gordon Hsu
 
Essay On Heart Failure
Essay On Heart FailureEssay On Heart Failure
Essay On Heart Failure
Heidi Owens
 
Major Cardiac Circuits
Major Cardiac CircuitsMajor Cardiac Circuits
Major Cardiac Circuits
Felicia Barker
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
ijdmtaiir
 
Stress echo and aortic stenosis
Stress echo and aortic stenosisStress echo and aortic stenosis
Stress echo and aortic stenosis
Cardiovascular Diagnosis and Therapy (CDT)
 

Similar to Project ppt (20)

Heart attack possibility.pptx
Heart attack possibility.pptxHeart attack possibility.pptx
Heart attack possibility.pptx
 
Heart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning AnalysisHeart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning Analysis
 
Heart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning AnalysisHeart Disease Classification: Machine Learning Analysis
Heart Disease Classification: Machine Learning Analysis
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Stats(4)-1
Heart Stats(4)-1Heart Stats(4)-1
Heart Stats(4)-1
 
Heart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptxHeart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptx
 
biostat 2.pptx h h jbjbivigyfyfyfyfyftftc
biostat 2.pptx h h jbjbivigyfyfyfyfyftftcbiostat 2.pptx h h jbjbivigyfyfyfyfyftftc
biostat 2.pptx h h jbjbivigyfyfyfyfyftftc
 
Tom Nguyen - SAS Project
Tom Nguyen - SAS ProjectTom Nguyen - SAS Project
Tom Nguyen - SAS Project
 
CARDIO docx
CARDIO docxCARDIO docx
CARDIO docx
 
Hypertention ppt
Hypertention pptHypertention ppt
Hypertention ppt
 
Pro / Con Debate on Central Blood Pressure
Pro / Con Debate on Central Blood PressurePro / Con Debate on Central Blood Pressure
Pro / Con Debate on Central Blood Pressure
 
A Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision TreeA Heart Disease Prediction Model using Decision Tree
A Heart Disease Prediction Model using Decision Tree
 
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCEPREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
 
Vital signs - Blood pressure Monitoring
Vital signs - Blood pressure Monitoring Vital signs - Blood pressure Monitoring
Vital signs - Blood pressure Monitoring
 
Pulse-Dynamics_e-Book (2013)
Pulse-Dynamics_e-Book (2013)Pulse-Dynamics_e-Book (2013)
Pulse-Dynamics_e-Book (2013)
 
Essay On Heart Failure
Essay On Heart FailureEssay On Heart Failure
Essay On Heart Failure
 
Major Cardiac Circuits
Major Cardiac CircuitsMajor Cardiac Circuits
Major Cardiac Circuits
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
 
Stress echo and aortic stenosis
Stress echo and aortic stenosisStress echo and aortic stenosis
Stress echo and aortic stenosis
 

Recently uploaded

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 

Recently uploaded (20)

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 

Project ppt

  • 1. R PROGRAMMING PROJECT LOGISTIC REGRESSION ANALYSIS TO DETECT THE PRESENCE OF HEART DISEASES
  • 2. CONTRIBUTION Kritika Jain and Vanya Vasudeva: coding, presentation and research Vansh Puri and Anant Goyal: data cleaning
  • 3. Introduction Heart disease is one of the top leading causes of death accounting for 17.7 million deaths each year, 31% of all global deaths, as reported by World Health Organization 2017. Several clinical information and symptoms are found to be related to Heart Diseases including age, blood pressure, total cholestrol, diabetes, hyper tension . Heart Disease dataset basically consists of the above-mentioned information and attributes which was summarized and collected from the patients. With the huge amounts of data made available in recent years, the diagnosis of Heart Diseases can be automatically performed using traditional statistical methods to predict the potential of having Heart Diseases on each patient.
  • 4. Aim of analysis  The aim of analyzing this data set is to predict the people suffering from heart diseases based on certain common factors.  Since the prediction has to be in a yes or no format, the dependent variable is categorical and thus can be regressed using logistic regression or random forest.
  • 5. EXPLORATORY DATA ANALYSIS  Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and the check assumptions with the help of summary statistics and graphical representations. It employs a variety of techniques to: 1. maximize insight into a data set; 2. uncover underlying structure; 3. extract important variables; 4. detect outliers and anomalies; 5. test underlying assumptions; 6. develop parsimonious models; and 7. determine optimal factor settings
  • 6. Components of EDA  The main components of exploring data, 1. Understanding your variables  Importing dataset  Change structure  Summary statistics: calculating mean, median, mode, skewness, kurtosis, variance,etc 2. Cleaning your dataset  Removing redundant variables (NA values)  Variable selection  Removing outliers: using boxplots 3. Analysing the variables  Visualising data : histograms, bar plots, scatterplot, etc  Checking for correlation : using corrplots  Creating a model  Using logistic regression to make predictions  Check for accuracy of the model
  • 7. What is logistic regression analysis? Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary) or categorical. Like all regression analysis, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables
  • 8. Major assumptions in binary logistic regression  The dependent variable should be dichotomous in nature (e.g., presence vs. absent).  There should be no outliers in the data.  OLS assumptions should be satisfied.  At the center of the logistic regression analysis is the task estimating the log odds of an event. Mathematically, logistic regression estimates a multiple linear regression function defined as: log(p) for i = 1…n . Regression coefficients explain the change in log(odds) in the response for a unit change in predictor. However, since the relationship between p(X) and X is not straight line, a unit change in input feature doesn't really affect the model output directly but it affects the odds ratio.
  • 9.  age - age in years  sex - (1 = male; 0 = female)  cp - chest pain type 0: Typical angina: chest pain related to decreased blood supply to the heart 1: Atypical angina: chest pain not related to heart 2: Non-anginal pain:typically esophaegal spasms; non heart related 3: Asymptomatic: chest pain not showing signs of disease  trestbps - resting blood pressure (in mm Hg on admission to the hospital) above 130-140 - cause for concern  chol - serum cholestrol in mg/dl above 200 is cause for concern Understanding the variables
  • 10.  fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) '>126' mg/dL signals diabetes  restecg - resting electrocardiographic results 0: Nothing to note 1: can range from mild symptoms to severe problems signals non-normal heart beat 2: Enlarged heart's main pumping chamber  thalach - maximum heart rate achieved  exang - exercise induced angina (1 = yes; 0 = no)  oldpeak – stress of the heart during excercise unhealthy heart stresses more
  • 11.  slope - the slope of the peak exercise ST segment 0: Upsloping: better heart rate with excercise 1: Flatsloping: typical healthy heart 2: Downslopins: signs of unhealthy heart  ca - number of major vessels (0-3) colored by flourosopy -colored vessel means the doctor can see the blood passing through - more the blood movement, better is the functioning of the heart (no clots)  thal - thalium stress result 0,1: normal 2: fixed defect 3: reversable defect: no proper blood movement when exercising  target - have disease or not (1=yes, 0=no) - the predicted attribute
  • 12. Changes made to variables  For simplification, we changed the variable names.  Cp=chest_pain_type  Trestbps= rest_bp  Chol= cholesterol  Fbs= fast_bs  Restecg= rest_ecg  Thalach= max_hr  Exang= ex_induced_angina  Old peak= ST_dep  Ca= vessels  Thal= defect
  • 13. Other changes to data  The classes of the following variables are converted into factors to accurately define their context:  Column age has been converted into an integer  Neither NA nor missing values were found in the data  Slope  Vessels  Defect  Target  Sex  Chest_pain_type  Fast_bs  Rest_ecg  Ex_induced_angina
  • 14. Summary statistics of data  Summary statistics refers to a quick description of the data basically including mean, median, mode, skewness, kurtosis, variance and standard deviation of various variables. Some of these measures variables that are numeric in nature.
  • 15. Removing outliers To remove the outliers, we use boxplots. We plot the data on boxplot and further replace the outliers with mean, median or mode. Maximum heart rate (before and after)
  • 19. Data visualizations  Data visualization refers to plotting the data in histograms, bargraphs, scatterplots, etc for easier and efficient understanding of the data.  A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.  Histogram: a graphical display of data using bars of different heights. It is similar to a Bar Chart, but a histogram groups numbers into ranges . The height of each bar shows how many fall into each range.  A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the dataset gets plotted as a point whose ( x , y ) (x, y) (x,y)left parenthesis, x, comma, y, right parenthesis coordinates relates to its values for the two variables.
  • 20. Histograms In order to make the histograms, we filtered the data using the “DPLYR package” to get the Data of people suffering from heart disease. We furthered used the “HIST” function to create The histograms. It shows the number of people of a particular age, suffering from heart disease. As you age, so do your blood vessels. They become less flexible, making it harder for blood to move through them easily We observe that people in the age group of 40-65 are most likely To suffer from heart disease.
  • 21.  It shows the frequency of of ST depressions undergone by patients suffering from heart disease.  ST depression shows the stress of the heart induced by exercise.  Lower the ST depression, greater is the stress of the heart.  Heart patients experience relatively higher stress of the heart.
  • 22. It shows that the maximum number of people having heart diseases have a resting bp between 120-140mm Hg. A very high or very low blood pressure are both causes of concern for heart patients. The graph shows that heart patients tend to have a high bp. The normal range for blood pressure is 80-120mm Hg
  • 23.  It shows that that the maximum number of people suffering from heart disease have cholesterol levels between 200 to 240.  According to research, Cholesterol levels above 200 is a cause for concern.
  • 24.  Heart rate is the speed of the heartbeat measured by the number of contractions (beats) of the heart per minute (bpm).  The heart rate can vary according to the body's physical needs, including the need to absorb oxygen.  It shows that the heart disease patients generally have a maximum heart rate of 160-180.  Researches show that maximum heart rate should not fall too low and it should not rise too high either.
  • 25. Scatterplot The scatterplot shows a downward relation between age and maximum heart rate achieved. As age increases, maximum heart rate achieved falls for a diseased person.
  • 26. Mixed Corrplot  There should be no high correlations (multicollinearity) among the predictors. This can be assessed by a correlation matrix among the predictors.  We created a data set of numeric variables and calculated the correlation amongst the variables.  We further used the code of a mixed corrplot to visualize our calculations. Since the correlation between the variables is less than 0.90, these variables can be used in the model.
  • 27. Corrplot INSIGHTS:  As age increases, cholesterol, stress of heart during exercise and resting bp also increase. On the other hand, maximum heart rate falls with old age.  As cholesterol increases, stress of heart during exercise and resting bp increase, while maximum heart rate falls.  As ST depression rises, i.e. stress of the heart falls, resting bp rises.  Resting bp also has a negative relation with maximum heart rate.  The degree of correlation is very small between all variables. However, age and maximum heart rate show a slightly higher correlation.  St depression and maximum heart rate also show similar results.
  • 28. Bar chart We observe that the number of males are higher than the number of females in our dataset. We thus assumed that it is a biased data. However, we later realized that a larger proportion of male population going for heart check ups implies a higher degree of risk for males.
  • 29. Creating a model  We use LOGISTIC REGRESSION because the dependent variable, in our case, target (whether a person suffers from heart disease or not) is categorical in nature.  1: person suffers from heart disease  0: person does not suffer from heart disease
  • 30. Division into train data and test data  We use the “CARET package” to divide the entire dataset into 2 parts:  Train: this part of the dataset is used for model building. Analysis is done for this dataset and an appropriate model is built according to requirements.  Test: this part of the dataset is used to test the model. The output of this dataset is is compared with the original output and the accuracy of the original model can be predicted. The function “CreateDataPartition” is used to split the data into 60% for training set and 40% for testing set. We do this to predict the dependent variable target.
  • 31. Variable selection  Method 1: BACKWARD SELECTION Backward selection (or backward elimination), which starts with all predictors in the model , iteratively removes the least contributive predictors, and stops when you have a model where all predictors are statistically significant. Using backward selection, the following variables are found to be statistically significant in our analysis: Hence, we build a model from the variables selected through backward elimination. • Sex • Chest pain type • Rest bp • Ex induced angina • ST depression • Slope • Vessels • defect
  • 32. Result of Backward Elimination Model1:
  • 33. Checking for multicollinearity  The variables should not be correlated amongst themselves in order to have better accuracy.  We check the variance inflation factor(VIF) inorder to check for multicollinearity.  If the VIF of a variable is greater than 5, it indicates possibility of multi collinearity. Therefor we will remove any variable that has a VIF>5.  However VIF can only be checked for numeric data
  • 34. Since none of the variables have a VIF >5, therefore we use all variables.
  • 35.  Method2: random forest Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
  • 36.  We applied random forest on our entire dataset in order to get the variable importance plot.  Variable importance plot shows the mean decrease accuracy, which represents by how much does removing each variable reduces the accuracy of the model.  Higher the value of mean decrease accuracy or mean decrease gini score, higher the importance of variable in the model.  With the least mean decrease accuracy, Resting bp, fasting blood sugar and resting ecg were eliminated from our model. VARIABLE IMPORTANCE PLOT
  • 37. Model 2:  With the help of mean decrease accuracy, we build a logistic model using:  age, sex, chest pain type,cholesterol, maximum heart rate, exercise induced angina, ST depression, slope, vessels and defect.
  • 38. Accuracy and ROCR curve. Model1: Accuracy= 85.925% Area under the curve(AUC)= 87.41% Model2: Accuracy= 79.33% Area under the curve(AUC)= 86.52% We compare the actual and predicted values of both models and the results are as follows:
  • 39. Result  Accuracy= correct predictions/ total predictions  ROCR shows the area under the curve. Greater the area, more reliable is the model.  Since the accuracy and area under the curve of Model1 is better, it is a better fit for analysis.
  • 40. Confusion matrix  Accuracy : the proportion of the total number of predictions that were correct.  Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.  Negative Predictive Value : the proportion of negative cases that were correctly identified.  Sensitivity or Recall or True Positive Rate : the proportion of actual positive cases which are correctly identified.  Specificity : the proportion of actual negative cases which are correctly identified.  Precision : d/c+d
  • 41. Accuracy of our model= 43+61/(43+61+5+12) = 85.95%
  • 42. Testing the significance of regressors  Null Hypothesis(Ho): the explanatory variable does not affect the dependent variable.  Alternative Hypothesis(Ha): The explanatory variable affects the dependent variable.  We check the summary of the model built with the help of backward elimination in order to check their p value.  If p<0.05 (5% level of significance), null hypothesis is rejected.  P value of sex, chest pain type, ST depression and vessels is less than 0.05.  Therefore, these variables should be added in our final model.
  • 43. Model3: The variables used in our model is based on the highest rated variables as per the variable importance plot from random forest. Therefore, with the help of the previous 2 models, we build our final model having variables sex, chest pain type, ST depression and vessels. All the 4 variables have high significance level with respect to their p-values.
  • 44.
  • 45. Checking heteroscedasticity We plot the fitted values with the residuals of the logistic model in order to get the graph. Since the scatterplot is not funnel shaped, this mode is free from heteroscedasticity.
  • 46. Checking accuracy Accuracy of the model= 44+64/44+64+2+11=89.25% Area under the curve= 91.55%
  • 47. Conclusion  This dataset is old and small by today's standards. However, it has allowed us to create a simple model and use machine learning explainability tools and techniques to peek inside.  At the start, we hypothesised, using research knowledge that factors such as cholesterol, age and fasting blood sugar would be major factors in the model. However, this dataset didn't show that. Instead, the number of major factors and aspects of ECG results dominated.
  • 48. Insights  Sex: we observe that a larger number of male population appears in heart check ups. Thus males are more prone to heart diseases.  Chest pain type: chest pain types have a major role in predicting heart diseases as the patients suffering from angina pains have higher probability of suffering from heart diseases. Typical angina pain is a major indicator of heart disease.  ST depression: the stress of the heart during exercise causes higher risk of heart disease. Thus higher the ST depression, lower is the risk.  Vessels: vessels refer to the fluoroscopy test done as an indicator of blood flow through the vessels. Darker the color of fluoroscopy, lower is the risk of heart disease.