SlideShare a Scribd company logo
1 of 19
Heart Disease
Prediction Analysis
“In the realm of healthcare advancement, this project focusses on the
strategic refinement of prediction algorithms utilizing machine learning
techniques. By sharpening our focus on heart disease prediction, we aim
to pioneer advancements that will redefine the early intervention
strategies and contribute to the overall improvement of the patient
health outcomes.”
Presented By :Sushil Gupta
01
Introduction
03
07
02
08
06
04
05
Reason / Issues causing Heart Diseases
Models and early detection bring essential
benefits
Data Gathering /Data
Refinement
EXPLORATORY DATA
ANALYSIS
Find patterns and identify trends
Gain valuable insights from the dataset
Feature Extraction
Divide Features into X and Y
Machine Learning / Deep learning
Methodology
Algorithms or Models Used
Compare each Evaluate Algorithm
Experimental results
Visuals of Compared models
Accuracy, confusion matrix of models
Conclusion
Importance of the used approach
Suggestion for future
Appendices and references
Links and codes used in the project
Comprehensive Dataset Overview
Refinement and Data Preparation Steps
INTRODUCTION
• Cardiovascular Disease (CVDs) refers to a group of disorders affecting
the heart and blood vessels and heart failures is a common event caused
by CVDs. Early detection and Data Science models in medicine can help
doctors foresee health issues before they become serious.
• This dataset forms a comprehensive exploration of patient health
indicators within a medical context, specifically focusing on factors
related to heart disease.
• The datasets encompass 13 vital features such as age, blood pressure,
cholesterol, fasting blood sugar, chest pain and more.
• The target variable, 'Heart Disease Presence,' signifies the likelihood of
an individual having heart disease.
17.9m
F a t a l i t i e s C a u s e d
B y C V D ( e a c h y e a r )
33%
Hypertension
Smoking
Obesity
Sedentary
lifestyle
Hyperlipidaemia Alcohol
G l o b a l
D e a t h
Value Of The Study
Objective OF The
Study
Early detection
and Prevention
Real - time
Monitoring
Improved Patient
Outcomes
Public Health
Advancement
Comparing
Algorithms
Risk
Stratification
STEP
01
Data Collection
STEP
03
STEP
05
STEP
07
STEP
02
Exploratory
Data Analysis
PREPROCESSING
STEP
04
STEP
06
WORK
FLOW
WORK
FLOW
Gathering and organizing data
to train machine learning
models.
Analyzing and visualizing data
patterns to understand its
characteristics
Preparing and cleaning data to
enhance its quality and
suitability
Split Train And Test
Data
Dividing the dataset into
training and testing sets to
evaluate
Model Selection and
Model Training
Choosing a suitable machine
learning algorithm and
optimizing its parameters
Model Evaluation
Assessing the performance of a
machine learning model using
metrics
Monitor and Update
Continuously monitor the
model's performance in the real-
world scenario
C O L U M N S
H e a r t _ 1
R O W S
13
1025
F e a t u r e s
( C o l u m n s )
D e m o g r a p h i
c
I n f o r m a t i o n
C l i n i c a l
D a t a
A g e
S e x
O u t c o m e
V a r i a b l e
C h e s t P a i n T y p e , C h o l e s t e r o l ,
F a s t i n g B S , R e s t i n g E C G ,
M a x H R , E x e r c i s e A n g i n a ,
O l d p e a k , S T S l o p e
H e a r t D i s e a s e
X
Y
1 - H e a r t D i s e a s e
0 : N o r m a l
EXPLORATORY DATA
ANALYSIS
• Exploratory Data Analysis (EDA) helped us understand the data structure, find
patterns, identify trends, and gain valuable insights from the dataset.
• From EDA we analyze, the distribution of each features, checking the correlation
between the features .
• Using visuals helped us see the data clearly, understand the clinical data
relationship with each other, and pinpoint factors that plays vital role in
predictions of heart disease.
• The datasets is clean ,as there is no NULL VALUE, but we can find some
DUPLICATE VALUES and OUTLIERS.
• We looked at our data columns and found 9 columns to be numerical in terms of
data types, but they are categorical in terms of semantic. So, we converted them to
object for proper analysis.
• We used the univariate and bivariate analysis approach to gain insights into
individual characteristics of the data and likewise how each feature relates to
main goal: predicting the target variable.
51%
49%
HEART PATIENT
Heart Presence
Not Present
6%
64%
30%
AGE GROUP VS TARGET
Young Age Middle Age
Old Age
DISTRIBUTION OF CONTINUOUS
VARIABLE
• Age: Uniform distribution, peak at late 50s ,
mean age is approx. 54.43 years.
• Resting Blood Pressure: Concentration around
120-140 mm Hg, mean is approx. 131.61 mm Hg.
• Cholesterol: Avg. between (200- 280 mg/dl).
• Thalach: (Maximum heart rate) – Majority of
individuals achieve a heart rate between 140-170
bpm during a stress test.
• Oldpeak: (ST Depression Induced by Exercise) –
Most concentrated towards 0, indicating many
individuals did not experience significant ST
depression.
• Cp (Chest pain): Type (Typical Angina) seems to be
most prevalent amongst other.
• Fbs (Fasting Blood Sugar): Majority of patients have
their Fbs below 120 mg/dl, indicating high Fbs not
common condition in dataset.
• Exang (Exercise-Induced Angina): Majority of patient
do not experience exang, suggesting it might not be
common symptoms among the patients in this
dataset.
• Slope (Slope of the Peak Exercise St Segment):
Specific type 1(Flat), and 2(Downsloping) are more
common.
• Thal (Thallium Stress Test Result): Reversible Defect
type 2 seems to be more prevalent than the other.
• Ca (No. of major vessels colored by fluoroscopy):
Most patients have fewer major vessels with ‘0’
being the most frequent.
DISTRIBUTION OF
CATEGORICAL VARIABLE
CONTINUOUS FEATURE vs
TARGET
• Age: Patient having heart disease being a bit younger
on average than those without.
• Trestbps (Resting BP): Nearly identical indicating
limited differentiating power of this feature.
• Chol (Serum Cholesterol): Distribution for both
categories are quite close but mean for patient with
heart disease is slightly lower.
• Thalach (Max. Heart Rate Achieved): Noticeable
difference in distributions. Patients with heart disease
tend to achieve a higher maximum heart rate.
• ST Depression (Oldpeak): Lower for patient with heart
disease and their distribution nears 0 whereas the non-
disease has wider spread.
• Thalach  Seems to impact higher followed by ST
Depression (Oldpeak), and age.
CATEGORICAL FEATURE vs TARGET • Ca (No. of major vessels):
 Ca is inversely proportional to heart disease except for last
fluoroscopy. 0 – higher proportion (heart disease).
• Cp (Chest Pain):
Type 1,2 and 3 have higher proportion of heart disease
compared to type 0.
• Exang (Exercise induced Angina):
 Patient who did not experience exang (0) show higher
proportion of heart disease presence as compared to (1).
• Fbs (Fasting Blood Sugar):
 Similar difference in distributions.
• Restecg (Resting ECG):
 Type (1) – higher proportion of heart disease presence.
• Sex :
Females (1) exhibit a lower proportion of heart disease presence
compared to males (0).
• Slope (Slope of the Peak Exercise ST Segment):
Slope type 2 has higher proportion of heart disease presence.
• Thal (Thallium Stress Test Result):
 Reversible defect category (2) has a higher proportion of heart
disease presence compared to the other categories.
Summary:
•Higher Impact on Target: ca, cp, exang, sex, slope, and thal
•Moderate Impact on Target: restecg
•Lower Impact on Target: fbs
CORRELATION
HEATMAP
• Positive correlation were observed between
the target variables and “Cp”, “Thalach”
and “Slope”.
• But “Exang”, “Oldpeak”, “Ca” and “Thal"
looks like highly negatively correlated with
target.
• Additionally, “age” and “sex” exhibited
moderate correlation with the target.
• While “Trestbps”, “Chol”, “fbs” and “restecg”
demonstrated minimal correlation with the
target.
 Irrelevant Feature Removal: All features in the dataset appears to be relevant based on EDA.
We will retain all the columns, ensuring no valuable information is lost, especially given the
dataset’s small size.
 Missing Value Treatment: No missing value found in the dataset.
 Outliers Treatment: Checked outliers using IQR method for the continuous features and upon
identifying outliers, nature of algorithm, and given small dataset size direct removal of
outliers might not be best approach. Instead, we will apply Box-Cox transformation to
stabilize variance and make the data more normal-distribution.
 Categorical Feature Encoding: Applied one hot encoding to the columns like “Cp”, “Thal” and
“Restecg” since these variables are nominal variables.
 Feature Scaling: Are imp. for the algorithms that are sensitive to the magnitude and scale of
feature, but not all algorithms requires scaling like Decision Tree are scale-invariant and
given our intent to use mix-model we’ve chosen to handle scaling later using pipelines.
PREPROCESSSING
TRAIN TEST SPLIT
• We divided the data into training (80%) and testing (20%) sets.
• Setting a random state ensures consistent results and using
stratify=y maintains a proportional distribution of the target
variable in both sets.
• We divided the dataset into two parts: X and y.
• "X" typically represents the independent Variables, and
"y" represents the Dependent (target variable) that we
want to predict or understand.
SPLITING THE DATA INTO X & Y
MODEL SELECTION
 Models used:
• Logistic Regression: logistic Regression is commonly used for binary classification
problems. it's preferred because it provides a simple an efficient way to model the
relationship between the independent variables and the probability of a certain
outcome.
• Decision Tree: Decision Tree algorithms are used for classification because they are
simple, computationally efficient, and effective in handling high-dimensional data.
Works best for categorical independent columns.
• Support Vector Machine: SVM is a powerful supervised algorithm that works best on
smaller datasets but on complex ones. Support Vector Machine(SVM)can be used for
both regression and classification tasks, but generally, they work best in classification
problems.
• Random Forest Algorithm: Random Forest: Random Forest is a robust supervised
algorithm suitable for both regression and classification tasks.
Support Vector
Machine
99 100 98 99
Random Forest
Classifier
88 85 93 89
Decision Tree 88 87 90 88
Logistic Regression 82 82 85 83
0
20
40
60
80
100
120
Support Vector
Machine
Random Forest
Classifier
Decision Tree Logistic
Regression
Model Comparison
Acuracy Precision Recall F1-Score
% % %
%
Percentage
(%)
SVM Dominates: Support Vector Machine excels with 99% accuracy, balanced precision
(100%) and recall (98%), showcasing superior overall classification performance.
Experimental Results :
Random Forest & Decision Tree Consistency: Both algorithms maintain 88% accuracy, with
Decision Tree having higher precision (87%) and Random Forest higher recall (93%).
Logistic Regression: LR achieves 82% accuracy and high precision (82%) but lower recall
(80%).
Understanding
Recall: Recall: The ability of a model to find all the relevant cases
Precision: The accuracy of the model when it claims to have found
something.
F1 Score: A balance between recall and precision, useful when both
false positives and false negatives need to be minimized.
Conclusion
Conclusion
Lack of Test Dataset Evaluation:
The model's performance on new, unseen data is not evaluated, raising concerns about its real-world
applicability.
Single and Small Size Dataset Limitation:
The study relies on a single dataset, potentially limiting its generalizability to diverse populations.
Limited Variable Consideration:
The analysis focuses narrowly on demographic and clinical variables, overlooking lifestyle and genetic factors
relevant to heart health.
Future research must ensure robustness, generalizability, and interpretability for informed decision-making
based on study findings. It's important to check how duplicate data and unusual values affect the model's
accuracy. Creating strategies to deal with these issues is valuable for improving model performance.
The results indicated that the Support vector Machine model had the highest accuracy of 99%
The study utilized the Kaggle Heart Failure Prediction dataset with 1025 instances, and all algorithms were
implemented on Jupyter Notebook
The accuracies of all algorithms were above 83% with the lowest accuracy of 83% given by Logistic Regression
and the highest accuracy given Support vector Machine as previously mentioned.
Heart Disease Prediction Analysis - Sushil Gupta.pptx

More Related Content

Similar to Heart Disease Prediction Analysis - Sushil Gupta.pptx

Data mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisData mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisSteve Iduye
 
Advice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchAdvice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchNancy Ideker
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiranKiran Ramakrishna
 
Analysis of Medication Possession Ratio for Improved Blood Pressure Control
Analysis of Medication Possession Ratio for Improved Blood Pressure ControlAnalysis of Medication Possession Ratio for Improved Blood Pressure Control
Analysis of Medication Possession Ratio for Improved Blood Pressure ControlHealth Informatics New Zealand
 
day1(2010 smg training_cardiff)_session2b (1of 2) lewis
day1(2010 smg training_cardiff)_session2b (1of 2) lewisday1(2010 smg training_cardiff)_session2b (1of 2) lewis
day1(2010 smg training_cardiff)_session2b (1of 2) lewisrgveroniki
 
Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Amany Elsayed
 
Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseSunil Kakade
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatMarwa Zalat
 
Common statistical errors in medical publications
Common statistical errors in medical publicationsCommon statistical errors in medical publications
Common statistical errors in medical publicationsARDC
 
Biostatistics clinical research & trials
Biostatistics clinical research & trialsBiostatistics clinical research & trials
Biostatistics clinical research & trialseclinicaltools
 
NON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantNON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantPRAJAKTASAWANT33
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
 
Epidemological methods
Epidemological methodsEpidemological methods
Epidemological methodsKundan Singh
 
Very good statistics-overview rbc (1)
Very good statistics-overview rbc (1)Very good statistics-overview rbc (1)
Very good statistics-overview rbc (1)Abdul Wasay Baloch
 
Clinical research ( Medical stat. concepts)
Clinical research ( Medical stat. concepts)Clinical research ( Medical stat. concepts)
Clinical research ( Medical stat. concepts)Mohamed Fahmy Dehim
 
Medical Statistics Pt 1
Medical Statistics Pt 1Medical Statistics Pt 1
Medical Statistics Pt 1Fastbleep
 

Similar to Heart Disease Prediction Analysis - Sushil Gupta.pptx (20)

Data mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosisData mining techniques on heart failure diagnosis
Data mining techniques on heart failure diagnosis
 
Advice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation ResearchAdvice On Statistical Analysis For Circulation Research
Advice On Statistical Analysis For Circulation Research
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
Analysis of Medication Possession Ratio for Improved Blood Pressure Control
Analysis of Medication Possession Ratio for Improved Blood Pressure ControlAnalysis of Medication Possession Ratio for Improved Blood Pressure Control
Analysis of Medication Possession Ratio for Improved Blood Pressure Control
 
day1(2010 smg training_cardiff)_session2b (1of 2) lewis
day1(2010 smg training_cardiff)_session2b (1of 2) lewisday1(2010 smg training_cardiff)_session2b (1of 2) lewis
day1(2010 smg training_cardiff)_session2b (1of 2) lewis
 
Project ppt
Project pptProject ppt
Project ppt
 
Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )
 
Clinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_DiseaseClinical_Decision_Support_For_Heart_Disease
Clinical_Decision_Support_For_Heart_Disease
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Common statistical errors in medical publications
Common statistical errors in medical publicationsCommon statistical errors in medical publications
Common statistical errors in medical publications
 
Biostatistics clinical research & trials
Biostatistics clinical research & trialsBiostatistics clinical research & trials
Biostatistics clinical research & trials
 
statistic
statisticstatistic
statistic
 
NON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantNON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta Sawant
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
Epidemological methods
Epidemological methodsEpidemological methods
Epidemological methods
 
Very good statistics-overview rbc (1)
Very good statistics-overview rbc (1)Very good statistics-overview rbc (1)
Very good statistics-overview rbc (1)
 
Clinical research ( Medical stat. concepts)
Clinical research ( Medical stat. concepts)Clinical research ( Medical stat. concepts)
Clinical research ( Medical stat. concepts)
 
Short story_2.pptx
Short story_2.pptxShort story_2.pptx
Short story_2.pptx
 
Medical Statistics Pt 1
Medical Statistics Pt 1Medical Statistics Pt 1
Medical Statistics Pt 1
 

More from Boston Institute of Analytics

Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgBoston Institute of Analytics
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFBoston Institute of Analytics
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Boston Institute of Analytics
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesBoston Institute of Analytics
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionCombating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionBoston Institute of Analytics
 
Predicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachPredicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachBoston Institute of Analytics
 
Employee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationEmployee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationBoston Institute of Analytics
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

More from Boston Institute of Analytics (20)

Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile Prices
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Analyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning projectAnalyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning project
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud DetectionCombating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection
 
Predicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning ApproachPredicting Liver Disease in India: A Machine Learning Approach
Predicting Liver Disease in India: A Machine Learning Approach
 
Employee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project PresentationEmployee Churn Prediction: Artificial Intelligence Project Presentation
Employee Churn Prediction: Artificial Intelligence Project Presentation
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Heart Disease Prediction Analysis - Sushil Gupta.pptx

  • 1.
  • 2. Heart Disease Prediction Analysis “In the realm of healthcare advancement, this project focusses on the strategic refinement of prediction algorithms utilizing machine learning techniques. By sharpening our focus on heart disease prediction, we aim to pioneer advancements that will redefine the early intervention strategies and contribute to the overall improvement of the patient health outcomes.” Presented By :Sushil Gupta
  • 3. 01 Introduction 03 07 02 08 06 04 05 Reason / Issues causing Heart Diseases Models and early detection bring essential benefits Data Gathering /Data Refinement EXPLORATORY DATA ANALYSIS Find patterns and identify trends Gain valuable insights from the dataset Feature Extraction Divide Features into X and Y Machine Learning / Deep learning Methodology Algorithms or Models Used Compare each Evaluate Algorithm Experimental results Visuals of Compared models Accuracy, confusion matrix of models Conclusion Importance of the used approach Suggestion for future Appendices and references Links and codes used in the project Comprehensive Dataset Overview Refinement and Data Preparation Steps
  • 4. INTRODUCTION • Cardiovascular Disease (CVDs) refers to a group of disorders affecting the heart and blood vessels and heart failures is a common event caused by CVDs. Early detection and Data Science models in medicine can help doctors foresee health issues before they become serious. • This dataset forms a comprehensive exploration of patient health indicators within a medical context, specifically focusing on factors related to heart disease. • The datasets encompass 13 vital features such as age, blood pressure, cholesterol, fasting blood sugar, chest pain and more. • The target variable, 'Heart Disease Presence,' signifies the likelihood of an individual having heart disease. 17.9m F a t a l i t i e s C a u s e d B y C V D ( e a c h y e a r ) 33% Hypertension Smoking Obesity Sedentary lifestyle Hyperlipidaemia Alcohol G l o b a l D e a t h
  • 5. Value Of The Study Objective OF The Study Early detection and Prevention Real - time Monitoring Improved Patient Outcomes Public Health Advancement Comparing Algorithms Risk Stratification
  • 6. STEP 01 Data Collection STEP 03 STEP 05 STEP 07 STEP 02 Exploratory Data Analysis PREPROCESSING STEP 04 STEP 06 WORK FLOW WORK FLOW Gathering and organizing data to train machine learning models. Analyzing and visualizing data patterns to understand its characteristics Preparing and cleaning data to enhance its quality and suitability Split Train And Test Data Dividing the dataset into training and testing sets to evaluate Model Selection and Model Training Choosing a suitable machine learning algorithm and optimizing its parameters Model Evaluation Assessing the performance of a machine learning model using metrics Monitor and Update Continuously monitor the model's performance in the real- world scenario
  • 7. C O L U M N S H e a r t _ 1 R O W S 13 1025 F e a t u r e s ( C o l u m n s ) D e m o g r a p h i c I n f o r m a t i o n C l i n i c a l D a t a A g e S e x O u t c o m e V a r i a b l e C h e s t P a i n T y p e , C h o l e s t e r o l , F a s t i n g B S , R e s t i n g E C G , M a x H R , E x e r c i s e A n g i n a , O l d p e a k , S T S l o p e H e a r t D i s e a s e X Y 1 - H e a r t D i s e a s e 0 : N o r m a l
  • 8. EXPLORATORY DATA ANALYSIS • Exploratory Data Analysis (EDA) helped us understand the data structure, find patterns, identify trends, and gain valuable insights from the dataset. • From EDA we analyze, the distribution of each features, checking the correlation between the features . • Using visuals helped us see the data clearly, understand the clinical data relationship with each other, and pinpoint factors that plays vital role in predictions of heart disease. • The datasets is clean ,as there is no NULL VALUE, but we can find some DUPLICATE VALUES and OUTLIERS. • We looked at our data columns and found 9 columns to be numerical in terms of data types, but they are categorical in terms of semantic. So, we converted them to object for proper analysis. • We used the univariate and bivariate analysis approach to gain insights into individual characteristics of the data and likewise how each feature relates to main goal: predicting the target variable. 51% 49% HEART PATIENT Heart Presence Not Present 6% 64% 30% AGE GROUP VS TARGET Young Age Middle Age Old Age
  • 9. DISTRIBUTION OF CONTINUOUS VARIABLE • Age: Uniform distribution, peak at late 50s , mean age is approx. 54.43 years. • Resting Blood Pressure: Concentration around 120-140 mm Hg, mean is approx. 131.61 mm Hg. • Cholesterol: Avg. between (200- 280 mg/dl). • Thalach: (Maximum heart rate) – Majority of individuals achieve a heart rate between 140-170 bpm during a stress test. • Oldpeak: (ST Depression Induced by Exercise) – Most concentrated towards 0, indicating many individuals did not experience significant ST depression.
  • 10. • Cp (Chest pain): Type (Typical Angina) seems to be most prevalent amongst other. • Fbs (Fasting Blood Sugar): Majority of patients have their Fbs below 120 mg/dl, indicating high Fbs not common condition in dataset. • Exang (Exercise-Induced Angina): Majority of patient do not experience exang, suggesting it might not be common symptoms among the patients in this dataset. • Slope (Slope of the Peak Exercise St Segment): Specific type 1(Flat), and 2(Downsloping) are more common. • Thal (Thallium Stress Test Result): Reversible Defect type 2 seems to be more prevalent than the other. • Ca (No. of major vessels colored by fluoroscopy): Most patients have fewer major vessels with ‘0’ being the most frequent. DISTRIBUTION OF CATEGORICAL VARIABLE
  • 11. CONTINUOUS FEATURE vs TARGET • Age: Patient having heart disease being a bit younger on average than those without. • Trestbps (Resting BP): Nearly identical indicating limited differentiating power of this feature. • Chol (Serum Cholesterol): Distribution for both categories are quite close but mean for patient with heart disease is slightly lower. • Thalach (Max. Heart Rate Achieved): Noticeable difference in distributions. Patients with heart disease tend to achieve a higher maximum heart rate. • ST Depression (Oldpeak): Lower for patient with heart disease and their distribution nears 0 whereas the non- disease has wider spread. • Thalach  Seems to impact higher followed by ST Depression (Oldpeak), and age.
  • 12. CATEGORICAL FEATURE vs TARGET • Ca (No. of major vessels):  Ca is inversely proportional to heart disease except for last fluoroscopy. 0 – higher proportion (heart disease). • Cp (Chest Pain): Type 1,2 and 3 have higher proportion of heart disease compared to type 0. • Exang (Exercise induced Angina):  Patient who did not experience exang (0) show higher proportion of heart disease presence as compared to (1). • Fbs (Fasting Blood Sugar):  Similar difference in distributions. • Restecg (Resting ECG):  Type (1) – higher proportion of heart disease presence. • Sex : Females (1) exhibit a lower proportion of heart disease presence compared to males (0). • Slope (Slope of the Peak Exercise ST Segment): Slope type 2 has higher proportion of heart disease presence. • Thal (Thallium Stress Test Result):  Reversible defect category (2) has a higher proportion of heart disease presence compared to the other categories. Summary: •Higher Impact on Target: ca, cp, exang, sex, slope, and thal •Moderate Impact on Target: restecg •Lower Impact on Target: fbs
  • 13. CORRELATION HEATMAP • Positive correlation were observed between the target variables and “Cp”, “Thalach” and “Slope”. • But “Exang”, “Oldpeak”, “Ca” and “Thal" looks like highly negatively correlated with target. • Additionally, “age” and “sex” exhibited moderate correlation with the target. • While “Trestbps”, “Chol”, “fbs” and “restecg” demonstrated minimal correlation with the target.
  • 14.  Irrelevant Feature Removal: All features in the dataset appears to be relevant based on EDA. We will retain all the columns, ensuring no valuable information is lost, especially given the dataset’s small size.  Missing Value Treatment: No missing value found in the dataset.  Outliers Treatment: Checked outliers using IQR method for the continuous features and upon identifying outliers, nature of algorithm, and given small dataset size direct removal of outliers might not be best approach. Instead, we will apply Box-Cox transformation to stabilize variance and make the data more normal-distribution.  Categorical Feature Encoding: Applied one hot encoding to the columns like “Cp”, “Thal” and “Restecg” since these variables are nominal variables.  Feature Scaling: Are imp. for the algorithms that are sensitive to the magnitude and scale of feature, but not all algorithms requires scaling like Decision Tree are scale-invariant and given our intent to use mix-model we’ve chosen to handle scaling later using pipelines. PREPROCESSSING
  • 15. TRAIN TEST SPLIT • We divided the data into training (80%) and testing (20%) sets. • Setting a random state ensures consistent results and using stratify=y maintains a proportional distribution of the target variable in both sets. • We divided the dataset into two parts: X and y. • "X" typically represents the independent Variables, and "y" represents the Dependent (target variable) that we want to predict or understand. SPLITING THE DATA INTO X & Y
  • 16. MODEL SELECTION  Models used: • Logistic Regression: logistic Regression is commonly used for binary classification problems. it's preferred because it provides a simple an efficient way to model the relationship between the independent variables and the probability of a certain outcome. • Decision Tree: Decision Tree algorithms are used for classification because they are simple, computationally efficient, and effective in handling high-dimensional data. Works best for categorical independent columns. • Support Vector Machine: SVM is a powerful supervised algorithm that works best on smaller datasets but on complex ones. Support Vector Machine(SVM)can be used for both regression and classification tasks, but generally, they work best in classification problems. • Random Forest Algorithm: Random Forest: Random Forest is a robust supervised algorithm suitable for both regression and classification tasks.
  • 17. Support Vector Machine 99 100 98 99 Random Forest Classifier 88 85 93 89 Decision Tree 88 87 90 88 Logistic Regression 82 82 85 83 0 20 40 60 80 100 120 Support Vector Machine Random Forest Classifier Decision Tree Logistic Regression Model Comparison Acuracy Precision Recall F1-Score % % % % Percentage (%) SVM Dominates: Support Vector Machine excels with 99% accuracy, balanced precision (100%) and recall (98%), showcasing superior overall classification performance. Experimental Results : Random Forest & Decision Tree Consistency: Both algorithms maintain 88% accuracy, with Decision Tree having higher precision (87%) and Random Forest higher recall (93%). Logistic Regression: LR achieves 82% accuracy and high precision (82%) but lower recall (80%). Understanding Recall: Recall: The ability of a model to find all the relevant cases Precision: The accuracy of the model when it claims to have found something. F1 Score: A balance between recall and precision, useful when both false positives and false negatives need to be minimized.
  • 18. Conclusion Conclusion Lack of Test Dataset Evaluation: The model's performance on new, unseen data is not evaluated, raising concerns about its real-world applicability. Single and Small Size Dataset Limitation: The study relies on a single dataset, potentially limiting its generalizability to diverse populations. Limited Variable Consideration: The analysis focuses narrowly on demographic and clinical variables, overlooking lifestyle and genetic factors relevant to heart health. Future research must ensure robustness, generalizability, and interpretability for informed decision-making based on study findings. It's important to check how duplicate data and unusual values affect the model's accuracy. Creating strategies to deal with these issues is valuable for improving model performance. The results indicated that the Support vector Machine model had the highest accuracy of 99% The study utilized the Kaggle Heart Failure Prediction dataset with 1025 instances, and all algorithms were implemented on Jupyter Notebook The accuracies of all algorithms were above 83% with the lowest accuracy of 83% given by Logistic Regression and the highest accuracy given Support vector Machine as previously mentioned.