SlideShare a Scribd company logo
1 of 55
Download to read offline
UCI
Heart
Disease
Prediction
Data
-By Supriya Kamble
Introduction
Cardiovascular diseases have been the most common cause of death worldwide over the last few decades in
developed as well as underdeveloped and developing countries. Early detection of cardiac diseases and
continuous supervision of clinicians can reduce the mortality rate. However, it is not possible to monitor
patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor is not available
since it requires more sapience, time, and expertise.
Every day, the average human heart beats around 100,000 times, pumping 2,000 gallons of blood through the
body. Inside your body, there are 60,000 miles of blood vessels. The signs of a woman having a heart attack are
much less noticeable than the signs of a man. In women, heart attacks may feel uncomfortable squeezing,
pressure, fullness, or pain in the center of the chest. It may also cause pain in one or both arms, the back, neck,
jaw, stomach, shortness of breath, nausea, and other symptoms.
Men experience typical symptoms of heart attack, such as chest pain, discomfort, and stress. They may also
experience pain in other areas, such as arms, neck, back, and jaw, and shortness of breath, sweating, and
discomfort that mimics heartburn. It’s a lot of work for an organ which is just like a large fist and weighs
between 8 and 12 ounces.
Objective of Data
The objective of the UCI Heart Disease dataset is to facilitate research and analysis aimed at developing
predictive models for the detection and assessment of heart disease. Specifically, the dataset aims to:
• Enable Prediction: Provide a diverse set of medical attributes and corresponding diagnoses to enable
the development of machine learning models capable of predicting the likelihood of heart disease in
patients.
• Support Research: Serve as a valuable resource for researchers and data scientists interested in
studying the factors associated with heart disease and exploring novel approaches to its diagnosis and
treatment.
• Promote Healthcare Innovation: Foster innovation in healthcare by empowering healthcare providers,
businesses, and policymakers with data-driven insights into heart disease risk assessment and
management.
• Improve Patient Outcomes: Ultimately, the primary objective of the dataset is to contribute to the
improvement of patient outcomes by facilitating early detection, intervention, and personalized
treatment of heart disease.
How data can help businesses
1) Healthcare Providers: Hospitals and clinics can use these models to assess the risk of heart disease in
patients during routine check-ups. This can lead to early detection and intervention, ultimately
improving patient outcomes and reducing healthcare costs.
2) Insurance Companies: Insurance companies can utilize these models to assess the risk of heart
disease in their policyholders. By identifying high-risk individuals, they can offer targeted
interventions or wellness programs to mitigate the risk and reduce claims.
3) Pharmaceutical Companies: Pharmaceutical companies can use predictive models to identify
potential candidates for clinical trials of new drugs aimed at preventing or treating heart disease. This
can streamline the drug development process and bring new treatments to market more efficiently.
4) Healthtech Startups: Startups focused on digital health and wellness can develop applications or
wearable devices that utilize heart disease prediction models to provide personalized health
recommendations to users. This can empower individuals to take proactive steps toward preventing
heart disease.
Real-life Applications
1) Clinical Decision Support: Healthcare professionals can use these models as decision-support tools
during patient consultations. By inputting patient data into the model, clinicians can obtain risk scores
and recommendations for further evaluation or treatment.
2) Public Health Initiatives: Public health authorities can utilize predictive models to identify
populations at high risk of heart disease and implement targeted prevention strategies, such as
educational campaigns, screening programs, or policy interventions.
3) Remote Monitoring: Remote monitoring devices equipped with heart disease prediction algorithms
can continuously monitor individuals at risk and alert them or their caregivers of any significant
changes or warning signs, enabling timely medical intervention.
4) Personalized Medicine: Predictive models can facilitate the shift towards personalized medicine by
enabling healthcare providers to tailor treatment plans based on an individual's risk profile and
genetic predisposition to heart disease.
About Dataset
• This is a multivariate type of dataset which means providing or involving various mathematical or
statistical variables, and multivariate numerical data analysis.
• It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum
cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved,
exercise-induced angina, old peak-ST depression induced by exercise relative to rest, the slope of the
peak exercise ST segment, number of major vessels and Thalassemia.
• This database includes 76 attributes, but all published studies relate to using a subset of 14 of them.
One of the major tasks of this dataset is to predict based on the given attributes of a patient whether
that particular person has heart disease or not. The other is the experimental task to diagnose and
find out various insights from this dataset which could help in understanding the problem more.
Column Descriptions
1) id: Unique identifier for each patient
2) age: Age of the patient in years
3) origin: Place of study
4) sex: Gender of the patient
5) cp: Chest pain type (e.g., typical angina, atypical angina, non-anginal, asymptomatic)
6) trestbps: Resting blood pressure (mm Hg on admission)
7) chol: Serum cholesterol level (mg/dl)
8) fbs: Fasting blood sugar (>120 mg/dl)
9) restecg: Resting electrocardiographic results
10) Values: normal, ST-T abnormality, left ventricular hypertrophy
11) thalach: Maximum heart rate achieved
12) exang: Exercise-induced angina (True/False)
13) oldpeak: ST depression induced by exercise relative to rest
14) slope: Slope of the peak exercise ST segment
15) ca: Number of major vessels colored by fluoroscopy (0-3)
16) thal: Thalassemia diagnosis (normal, fixed defect, reversible defect)
17) num: Predicted attribute indicating presence of heart disease
Challenges
1) Data Quality: Ensuring the accuracy and reliability of the medical data is crucial for building effective
prediction models. Incomplete or inaccurate data can lead to biased or unreliable predictions.
2) Feature Selection: Identifying the most relevant features or attributes from the dataset that
contribute to the prediction of heart disease is essential. This requires domain knowledge and careful
analysis of the data.
3) Imbalanced Data: Imbalance in the distribution of classes (i.e., presence or absence of heart disease)
can affect the performance of machine learning algorithms. Techniques such as oversampling, under-
sampling, or using algorithms that handle imbalanced data well are necessary to address this issue.
4) Interpretability: Building models that not only provide accurate predictions but also offer insights into
the factors contributing to the prediction is important for gaining trust from healthcare professionals
and patients.
Data Understanding
Begin by loading the dataset into Python Programming. Verifying that the dataset is loaded correctly and
examine the first few rows to get a glimpse of the data structure.
The size of the dataset is 920 rows and 16 attributes in which num is the dependent variable for which we
have to make the prediction.
Dataset Overview
Based on the summary above, it appears that the data
consists of a total of 920 observations. However, many
features in this dataset have missing values, including
trestbps, chol, fbs, restecg, thalch, exang, oldpeak, slope, ca,
and thal. In addition, the dataset contains both numeric and
categorical variables.
Exploring Numerical and Categorical Features
Exploratory Data Analysis (EDA)
Categorical Features – Countplot
Numerical Features – histplot
Outlier Detection
Based on the box plot above, trestbps, chol, and thalch exhibit outliers, especially chol. On the contrary, age and
exang are two features that do not have outliers.
Pattern of Missingness
• Based on the heatmap above, missing
values appear intensively starting
from the 300th row.
• The top three variables with the
highest number of observations with
missing values are slope, ca, and thal.
• So far, it does not look like the
missing values are distributed
randomly.
Correlation Matrix
• From the heatmap above, we
observe a strong relationship of
missing values between thalch
and trestbps, exang and
trestbps, oldpeak and trestbps,
etc.
• Once again, the pattern of
missing values among variables
does not appear random.
• As we mentioned above, the
dataset includes 15 variables.
However, at least 10 variables
have missing values.
• Hence, we will apply 2
imputation methods
(Median/Mode imputation and
Random Forest imputation) to
fill in the missing values.
Imputing Missing Values
Median/Mode Imputation
We will start by trying the simplest imputation method, which is Median/Mode Imputation, to fill in missing
values
we will fill in the missing values by inputting the median value if the feature is numerical. For categorical
features,
we will use the mode value to replace the missing values.
Numeric variables ==> median value
Categorical variables ==> mode value
Bivariate Analysis
Distribution of Age Among Patients with and without Heart Disease
We can notice that people between the ages of 40 and 70 are the most affected by heart disease
Heart Disease Prevalence by Sex
We can notice that men are more susceptible to heart disease at all levels.
Relationship Between Cholesterol Levels and Heart Disease
• The box plot illustrates cholesterol
levels across five heart disease
categories, showing median
values, range variability, and
outliers.
• Categories 1 to 4 have similar
medians, but the spread and
outliers differ, with category 0
showing the most variability
Maximum Heart Rate and Heart Disease
The plot shows a negative
correlation where the maximum
heart rate tends to decrease as
age increases.
The Impact of Exercise-Induced Angina on Heart Disease
• Most cases in category 0 do not
report angina, while categories 1
through 4 show a more varied
distribution, with both angina
and non-angina cases present.
• The data suggests that exercise-
induced angina is more
commonly reported in individuals
with heart disease categories 1
to 4 compared to category 0.
Average Resting Blood Pressure by Heart Disease Status
• All categories show similar
average blood pressures ranging
slightly above 120 mm Hg.
• The error bars indicate some
variability in the measurements,
with a slight trend toward
increasing variability from status
0 to 4.
Distribution of Chest Pain Type among Patients
• ‘Asymptomatic' is the most common
type of chest pain across all heart
disease statuses except for status 0,
where 'typical angina' is more
prevalent.
• 'Non-anginal' pain is notably
frequent in heart disease status 4,
while 'atypical angina' is relatively
less common across all states.
Fasting Blood Sugar and Heart Disease
• The majority of individuals across all
heart disease statuses have fasting
blood sugar levels at or below 120
mg/dl.
• For those with higher blood sugar
levels, the counts are notably lower,
suggesting that elevated fasting blood
sugar is less common among these
individuals regardless of their heart
disease status.
Heart Disease Prevalence by Resting Electrocardiographic Results
• Most individuals with a normal
ECG result fall into the '0' heart
disease category, indicating no
presence of heart disease.
• In contrast, those with ST-T
abnormalities show a higher
count of heart disease statuses 1
through 4.
• Left ventricular hypertrophy is
less common but shows some
presence across all heart disease
categories.
Data Preprocessing
If we just look at the data, we will see some of the features have categorical values. So we have to do one hot
encoding for them. Also, the original dataset contains the target as 0, 1, 2, 3, 4. But for identifying simply the
presence of disease, we will take binary classification. With that view in mind, we will convert all the target
features in the num column into 1/0.
One-Hot Encoding
Splitting the Dependent and Independent Features
Splitting the dependent and independent features using the train test split from the sklearn library. The test
size of the split is an 80-20 ratio.
Feature Scaling
• Normalization
The Min-Max Normalization method is used to Normalize the data. This method scales the data range to [0,1].
Machine Learning Model
Logistic Regression
In the above figure, the red dots represent the predicted values that are either 0 or 1 and the blue line & and dot
represent the actual value of that particular patient. In the places where the red dot and blue dot do not overlap
are the wrong predictions and where both dots overlap those are the right predicted values.
Model Evaluation
• The logistic regression has given an accuracy of 77.71%.
• From the confusion matrix, we can say the model can classify whether the disease is present or not. But
False Positives and False Negatives are also high to reduce this we will fit another classification model.
A ROC curve, or receiver operating characteristic curve, is like a graph that shows how well a classification
model performs.
Coefficients
Linear Regression calculates the total outcome by summing up
the weighted sum of the different features.
Random Forest Classifier
Random Forest has given accuracy of 79.34% which is better than Logistic Regression. Also, the precision,
recall, and F1 scores improved more than in the previous model.
Naïve Bayes
Naïve Bayes has given an accuracy of 77.7% which is the same as Logistic Regression. Also, the precision,
recall, and F1 scores have improved in this model.
Gradient Boosting Classifier
Gradient Boosting has performed better than all models till now with an accuracy of 80.43%. Also, the
model can classify the whether disease is present or not more accurately.
XGBoost Classifier
After applying the Xgboost classifier the confusion matrix True positive and True Negative has increased
from the previous model.
LightGBM
Here, the accuracy increased to 81.52% and also the false
negative and false positive decreased making the model
able to classify properly.
Hyperparameter Tuning
Hyperparameters are external configurations that guide the learning process but are not learned from the
data. It involves the systematic optimization of the parameters to enhance a model's performance. This
process often employs techniques like grid search, exploring different combinations of hyperparameter values
to find the optimal set that maximizes model accuracy or other performance metrics.
The accuracy of Xgboost didn’t
improve after doing hyperparameter
tuning on data.
The accuracy of LightGBM also didn’t improve.
Model Selection
• Since the accuracy of both Xgboost and LighGBM didn’t increase after tuning them with parameters.
But lightGBM has a high accuracy of 82% and also the model was able to correctly classify the classes.
Therefore, the LightGBM is the best model for the heart prediction data.
• As per the result, the model has around 82% precision score which is quite acceptable to predict heart
disease in an individual based upon the characteristics of age, sex, cp trestbps, chol, fbs, restecg,
thalch, exang, oldpeak, slope, ca, thal.
1) The patients' ages range from 29 to 77 years, with an average age of 54.
2) The majority of the patients are male (75.9%) and the most common type of chest pain experienced by
the patients is typical angina (39.6%).
3) The average resting blood pressure is 131.6 mmHg and the average cholesterol level is 246 mg/dL.
4) The average maximum heart rate achieved during exercise is 139.9 bpm.
5) Most patients (70.3%) do not experience exercise-induced angina.
6) The average ST depression induced by exercise is 1.04 mm the majority of the patients (54.8%) have a
normal ECG result.
7) Several classification models were trained and evaluated, including Logistic Regression, Random Forest,
Naive Bayes, Gradient Boosting, XGBoost, and LightGBM.
8) The LightGBM model achieved the highest accuracy of 80.97% after hyperparameter tuning.
9) The ROC curves and AUC scores for each model were analyzed to assess their performance.
10) The results suggest that the XGBoost and LightGBM models are suitable for predicting the presence or
absence of heart disease based on the available features.
Summary
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data Science.pdf

More Related Content

Similar to NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data Science.pdf

Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
kumari36
 
Heart disease prediction by using novel optimization algorithm_ A supervised ...
Heart disease prediction by using novel optimization algorithm_ A supervised ...Heart disease prediction by using novel optimization algorithm_ A supervised ...
Heart disease prediction by using novel optimization algorithm_ A supervised ...
BASMAJUMAASALEHALMOH
 
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docxRunning Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
toltonkendal
 
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNINGHEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
IJDKP
 
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docxRunning head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
toltonkendal
 

Similar to NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data Science.pdf (20)

Heart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptxHeart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptx
 
predictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdfpredictionofheartdiseaseusingmachinelearning.pdf
predictionofheartdiseaseusingmachinelearning.pdf
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Genetically Optimized Neural Network for Heart Disease Classification
Genetically Optimized Neural Network for Heart Disease ClassificationGenetically Optimized Neural Network for Heart Disease Classification
Genetically Optimized Neural Network for Heart Disease Classification
 
Heart Attack Prediction System Using Fuzzy C Means Classifier
Heart Attack Prediction System Using Fuzzy C Means ClassifierHeart Attack Prediction System Using Fuzzy C Means Classifier
Heart Attack Prediction System Using Fuzzy C Means Classifier
 
Predicting Heart Disease Using Machine Learning Algorithms.
Predicting Heart Disease Using Machine Learning Algorithms.Predicting Heart Disease Using Machine Learning Algorithms.
Predicting Heart Disease Using Machine Learning Algorithms.
 
Heart disease prediction by using novel optimization algorithm_ A supervised ...
Heart disease prediction by using novel optimization algorithm_ A supervised ...Heart disease prediction by using novel optimization algorithm_ A supervised ...
Heart disease prediction by using novel optimization algorithm_ A supervised ...
 
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docxRunning Head SCENARIO NCLEX MEMORIAL HOSPITAL                    .docx
Running Head SCENARIO NCLEX MEMORIAL HOSPITAL .docx
 
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNINGHEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
HEART DISEASE PREDICTION USING MACHINE LEARNING AND DEEP LEARNING
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
APPLYING MACHINE LEARNING TECHNIQUES TO FIND IMPORTANT ATTRIBUTES FOR HEART F...
APPLYING MACHINE LEARNING TECHNIQUES TO FIND IMPORTANT ATTRIBUTES FOR HEART F...APPLYING MACHINE LEARNING TECHNIQUES TO FIND IMPORTANT ATTRIBUTES FOR HEART F...
APPLYING MACHINE LEARNING TECHNIQUES TO FIND IMPORTANT ATTRIBUTES FOR HEART F...
 
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCEPREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
PREVENTION OF HEART PROBLEM USING ARTIFICIAL INTELLIGENCE
 
IRJET- A System to Detect Heart Failure using Deep Learning Techniques
IRJET- A System to Detect Heart Failure using Deep Learning TechniquesIRJET- A System to Detect Heart Failure using Deep Learning Techniques
IRJET- A System to Detect Heart Failure using Deep Learning Techniques
 
Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...Ascendable Clarification for Coronary Illness Prediction using Classification...
Ascendable Clarification for Coronary Illness Prediction using Classification...
 
Mining of medical data to identify risk factors of heart disease using freque...
Mining of medical data to identify risk factors of heart disease using freque...Mining of medical data to identify risk factors of heart disease using freque...
Mining of medical data to identify risk factors of heart disease using freque...
 
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
IRJET -Improving the Accuracy of the Heart Disease Prediction using Hybrid Ma...
 
Heart attack possibility.pptx
Heart attack possibility.pptxHeart attack possibility.pptx
Heart attack possibility.pptx
 
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docxRunning head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
Running head PHASE 1 SCENARIO NCLEX MEMOORIAL HOSPITAL1PHASE .docx
 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbu
 

More from Boston Institute of Analytics

More from Boston Institute of Analytics (20)

Solar production with K means clustering
Solar production with K means clusteringSolar production with K means clustering
Solar production with K means clustering
 
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesDemystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC ShootingsUnveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
 
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile Prices
 

Recently uploaded

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 

Recently uploaded (20)

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data Science.pdf

  • 2. Introduction Cardiovascular diseases have been the most common cause of death worldwide over the last few decades in developed as well as underdeveloped and developing countries. Early detection of cardiac diseases and continuous supervision of clinicians can reduce the mortality rate. However, it is not possible to monitor patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor is not available since it requires more sapience, time, and expertise. Every day, the average human heart beats around 100,000 times, pumping 2,000 gallons of blood through the body. Inside your body, there are 60,000 miles of blood vessels. The signs of a woman having a heart attack are much less noticeable than the signs of a man. In women, heart attacks may feel uncomfortable squeezing, pressure, fullness, or pain in the center of the chest. It may also cause pain in one or both arms, the back, neck, jaw, stomach, shortness of breath, nausea, and other symptoms. Men experience typical symptoms of heart attack, such as chest pain, discomfort, and stress. They may also experience pain in other areas, such as arms, neck, back, and jaw, and shortness of breath, sweating, and discomfort that mimics heartburn. It’s a lot of work for an organ which is just like a large fist and weighs between 8 and 12 ounces.
  • 3. Objective of Data The objective of the UCI Heart Disease dataset is to facilitate research and analysis aimed at developing predictive models for the detection and assessment of heart disease. Specifically, the dataset aims to: • Enable Prediction: Provide a diverse set of medical attributes and corresponding diagnoses to enable the development of machine learning models capable of predicting the likelihood of heart disease in patients. • Support Research: Serve as a valuable resource for researchers and data scientists interested in studying the factors associated with heart disease and exploring novel approaches to its diagnosis and treatment. • Promote Healthcare Innovation: Foster innovation in healthcare by empowering healthcare providers, businesses, and policymakers with data-driven insights into heart disease risk assessment and management. • Improve Patient Outcomes: Ultimately, the primary objective of the dataset is to contribute to the improvement of patient outcomes by facilitating early detection, intervention, and personalized treatment of heart disease.
  • 4. How data can help businesses 1) Healthcare Providers: Hospitals and clinics can use these models to assess the risk of heart disease in patients during routine check-ups. This can lead to early detection and intervention, ultimately improving patient outcomes and reducing healthcare costs. 2) Insurance Companies: Insurance companies can utilize these models to assess the risk of heart disease in their policyholders. By identifying high-risk individuals, they can offer targeted interventions or wellness programs to mitigate the risk and reduce claims. 3) Pharmaceutical Companies: Pharmaceutical companies can use predictive models to identify potential candidates for clinical trials of new drugs aimed at preventing or treating heart disease. This can streamline the drug development process and bring new treatments to market more efficiently. 4) Healthtech Startups: Startups focused on digital health and wellness can develop applications or wearable devices that utilize heart disease prediction models to provide personalized health recommendations to users. This can empower individuals to take proactive steps toward preventing heart disease.
  • 5. Real-life Applications 1) Clinical Decision Support: Healthcare professionals can use these models as decision-support tools during patient consultations. By inputting patient data into the model, clinicians can obtain risk scores and recommendations for further evaluation or treatment. 2) Public Health Initiatives: Public health authorities can utilize predictive models to identify populations at high risk of heart disease and implement targeted prevention strategies, such as educational campaigns, screening programs, or policy interventions. 3) Remote Monitoring: Remote monitoring devices equipped with heart disease prediction algorithms can continuously monitor individuals at risk and alert them or their caregivers of any significant changes or warning signs, enabling timely medical intervention. 4) Personalized Medicine: Predictive models can facilitate the shift towards personalized medicine by enabling healthcare providers to tailor treatment plans based on an individual's risk profile and genetic predisposition to heart disease.
  • 6. About Dataset • This is a multivariate type of dataset which means providing or involving various mathematical or statistical variables, and multivariate numerical data analysis. • It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, old peak-ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. • This database includes 76 attributes, but all published studies relate to using a subset of 14 of them. One of the major tasks of this dataset is to predict based on the given attributes of a patient whether that particular person has heart disease or not. The other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.
  • 7. Column Descriptions 1) id: Unique identifier for each patient 2) age: Age of the patient in years 3) origin: Place of study 4) sex: Gender of the patient 5) cp: Chest pain type (e.g., typical angina, atypical angina, non-anginal, asymptomatic) 6) trestbps: Resting blood pressure (mm Hg on admission) 7) chol: Serum cholesterol level (mg/dl) 8) fbs: Fasting blood sugar (>120 mg/dl) 9) restecg: Resting electrocardiographic results 10) Values: normal, ST-T abnormality, left ventricular hypertrophy 11) thalach: Maximum heart rate achieved 12) exang: Exercise-induced angina (True/False) 13) oldpeak: ST depression induced by exercise relative to rest 14) slope: Slope of the peak exercise ST segment 15) ca: Number of major vessels colored by fluoroscopy (0-3) 16) thal: Thalassemia diagnosis (normal, fixed defect, reversible defect) 17) num: Predicted attribute indicating presence of heart disease
  • 8. Challenges 1) Data Quality: Ensuring the accuracy and reliability of the medical data is crucial for building effective prediction models. Incomplete or inaccurate data can lead to biased or unreliable predictions. 2) Feature Selection: Identifying the most relevant features or attributes from the dataset that contribute to the prediction of heart disease is essential. This requires domain knowledge and careful analysis of the data. 3) Imbalanced Data: Imbalance in the distribution of classes (i.e., presence or absence of heart disease) can affect the performance of machine learning algorithms. Techniques such as oversampling, under- sampling, or using algorithms that handle imbalanced data well are necessary to address this issue. 4) Interpretability: Building models that not only provide accurate predictions but also offer insights into the factors contributing to the prediction is important for gaining trust from healthcare professionals and patients.
  • 9. Data Understanding Begin by loading the dataset into Python Programming. Verifying that the dataset is loaded correctly and examine the first few rows to get a glimpse of the data structure. The size of the dataset is 920 rows and 16 attributes in which num is the dependent variable for which we have to make the prediction.
  • 10. Dataset Overview Based on the summary above, it appears that the data consists of a total of 920 observations. However, many features in this dataset have missing values, including trestbps, chol, fbs, restecg, thalch, exang, oldpeak, slope, ca, and thal. In addition, the dataset contains both numeric and categorical variables.
  • 11. Exploring Numerical and Categorical Features
  • 12. Exploratory Data Analysis (EDA) Categorical Features – Countplot
  • 13.
  • 14.
  • 15.
  • 17.
  • 18. Outlier Detection Based on the box plot above, trestbps, chol, and thalch exhibit outliers, especially chol. On the contrary, age and exang are two features that do not have outliers.
  • 19. Pattern of Missingness • Based on the heatmap above, missing values appear intensively starting from the 300th row. • The top three variables with the highest number of observations with missing values are slope, ca, and thal. • So far, it does not look like the missing values are distributed randomly.
  • 20. Correlation Matrix • From the heatmap above, we observe a strong relationship of missing values between thalch and trestbps, exang and trestbps, oldpeak and trestbps, etc. • Once again, the pattern of missing values among variables does not appear random. • As we mentioned above, the dataset includes 15 variables. However, at least 10 variables have missing values. • Hence, we will apply 2 imputation methods (Median/Mode imputation and Random Forest imputation) to fill in the missing values.
  • 21. Imputing Missing Values Median/Mode Imputation We will start by trying the simplest imputation method, which is Median/Mode Imputation, to fill in missing values we will fill in the missing values by inputting the median value if the feature is numerical. For categorical features, we will use the mode value to replace the missing values. Numeric variables ==> median value Categorical variables ==> mode value
  • 23. Distribution of Age Among Patients with and without Heart Disease We can notice that people between the ages of 40 and 70 are the most affected by heart disease
  • 24. Heart Disease Prevalence by Sex We can notice that men are more susceptible to heart disease at all levels.
  • 25. Relationship Between Cholesterol Levels and Heart Disease • The box plot illustrates cholesterol levels across five heart disease categories, showing median values, range variability, and outliers. • Categories 1 to 4 have similar medians, but the spread and outliers differ, with category 0 showing the most variability
  • 26. Maximum Heart Rate and Heart Disease The plot shows a negative correlation where the maximum heart rate tends to decrease as age increases.
  • 27. The Impact of Exercise-Induced Angina on Heart Disease • Most cases in category 0 do not report angina, while categories 1 through 4 show a more varied distribution, with both angina and non-angina cases present. • The data suggests that exercise- induced angina is more commonly reported in individuals with heart disease categories 1 to 4 compared to category 0.
  • 28. Average Resting Blood Pressure by Heart Disease Status • All categories show similar average blood pressures ranging slightly above 120 mm Hg. • The error bars indicate some variability in the measurements, with a slight trend toward increasing variability from status 0 to 4.
  • 29. Distribution of Chest Pain Type among Patients • ‘Asymptomatic' is the most common type of chest pain across all heart disease statuses except for status 0, where 'typical angina' is more prevalent. • 'Non-anginal' pain is notably frequent in heart disease status 4, while 'atypical angina' is relatively less common across all states.
  • 30. Fasting Blood Sugar and Heart Disease • The majority of individuals across all heart disease statuses have fasting blood sugar levels at or below 120 mg/dl. • For those with higher blood sugar levels, the counts are notably lower, suggesting that elevated fasting blood sugar is less common among these individuals regardless of their heart disease status.
  • 31. Heart Disease Prevalence by Resting Electrocardiographic Results • Most individuals with a normal ECG result fall into the '0' heart disease category, indicating no presence of heart disease. • In contrast, those with ST-T abnormalities show a higher count of heart disease statuses 1 through 4. • Left ventricular hypertrophy is less common but shows some presence across all heart disease categories.
  • 32. Data Preprocessing If we just look at the data, we will see some of the features have categorical values. So we have to do one hot encoding for them. Also, the original dataset contains the target as 0, 1, 2, 3, 4. But for identifying simply the presence of disease, we will take binary classification. With that view in mind, we will convert all the target features in the num column into 1/0.
  • 34. Splitting the Dependent and Independent Features Splitting the dependent and independent features using the train test split from the sklearn library. The test size of the split is an 80-20 ratio.
  • 35. Feature Scaling • Normalization The Min-Max Normalization method is used to Normalize the data. This method scales the data range to [0,1].
  • 37. In the above figure, the red dots represent the predicted values that are either 0 or 1 and the blue line & and dot represent the actual value of that particular patient. In the places where the red dot and blue dot do not overlap are the wrong predictions and where both dots overlap those are the right predicted values.
  • 38. Model Evaluation • The logistic regression has given an accuracy of 77.71%. • From the confusion matrix, we can say the model can classify whether the disease is present or not. But False Positives and False Negatives are also high to reduce this we will fit another classification model.
  • 39. A ROC curve, or receiver operating characteristic curve, is like a graph that shows how well a classification model performs.
  • 40. Coefficients Linear Regression calculates the total outcome by summing up the weighted sum of the different features.
  • 41. Random Forest Classifier Random Forest has given accuracy of 79.34% which is better than Logistic Regression. Also, the precision, recall, and F1 scores improved more than in the previous model.
  • 42.
  • 43. Naïve Bayes Naïve Bayes has given an accuracy of 77.7% which is the same as Logistic Regression. Also, the precision, recall, and F1 scores have improved in this model.
  • 44.
  • 45. Gradient Boosting Classifier Gradient Boosting has performed better than all models till now with an accuracy of 80.43%. Also, the model can classify the whether disease is present or not more accurately.
  • 46.
  • 47. XGBoost Classifier After applying the Xgboost classifier the confusion matrix True positive and True Negative has increased from the previous model.
  • 48.
  • 49. LightGBM Here, the accuracy increased to 81.52% and also the false negative and false positive decreased making the model able to classify properly.
  • 50.
  • 51. Hyperparameter Tuning Hyperparameters are external configurations that guide the learning process but are not learned from the data. It involves the systematic optimization of the parameters to enhance a model's performance. This process often employs techniques like grid search, exploring different combinations of hyperparameter values to find the optimal set that maximizes model accuracy or other performance metrics. The accuracy of Xgboost didn’t improve after doing hyperparameter tuning on data.
  • 52. The accuracy of LightGBM also didn’t improve.
  • 53. Model Selection • Since the accuracy of both Xgboost and LighGBM didn’t increase after tuning them with parameters. But lightGBM has a high accuracy of 82% and also the model was able to correctly classify the classes. Therefore, the LightGBM is the best model for the heart prediction data. • As per the result, the model has around 82% precision score which is quite acceptable to predict heart disease in an individual based upon the characteristics of age, sex, cp trestbps, chol, fbs, restecg, thalch, exang, oldpeak, slope, ca, thal.
  • 54. 1) The patients' ages range from 29 to 77 years, with an average age of 54. 2) The majority of the patients are male (75.9%) and the most common type of chest pain experienced by the patients is typical angina (39.6%). 3) The average resting blood pressure is 131.6 mmHg and the average cholesterol level is 246 mg/dL. 4) The average maximum heart rate achieved during exercise is 139.9 bpm. 5) Most patients (70.3%) do not experience exercise-induced angina. 6) The average ST depression induced by exercise is 1.04 mm the majority of the patients (54.8%) have a normal ECG result. 7) Several classification models were trained and evaluated, including Logistic Regression, Random Forest, Naive Bayes, Gradient Boosting, XGBoost, and LightGBM. 8) The LightGBM model achieved the highest accuracy of 80.97% after hyperparameter tuning. 9) The ROC curves and AUC scores for each model were analyzed to assess their performance. 10) The results suggest that the XGBoost and LightGBM models are suitable for predicting the presence or absence of heart disease based on the available features. Summary