238_heartdisease (1).pdf

2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), 8-9 July 2021, Rajshahi, Bangladesh
Applying Machine Learning Classifiers on ECG
Dataset for Predicting Heart Disease
Adiba Ibnat Hossain∗, Sabitri Sikder†, Annesha Das‡ and Ashim Dey§
Department of Computer Science and Engineering
Chittagong University of Engineering and Technology
Chittagong-4349, Bangladesh
∗hossainadiba123@gmail.com, †sabitri287525@gmail.com, ‡annesha@cuet.ac.bd, §ashim@cuet.ac.bd
Abstract—Sudden demise from heart disease is rising in a
terrible rate and this disease has become a common cause of
death worldwide. But it is a matter of hope that heart diseases are
avertible by making simple lifestyle changes coupled with early
prognosis which can greatly improve its recovery. Identifying high
risk patients is difficult due to the multifaceted characteristic
of various threat factors such as high cholesterol, high blood
pressure, diabetes etc. Most of the time, diagnosis of heart
disease depends on doctor’s observation and expertise instead
of utilizing the large amount of knowledge-rich medical dataset.
To change the situation, scientists and doctors have turned to
machine learning techniques to evaluate screening results along
with other medical parameters to predict heart disease. For
heart disease prediction, this study implements five machine
learning algorithms including Support Vector Machine, Logistic
Regression, K-nearest Neighbor, Naive Bayes, and Ensemble
Voting Classifier on a dataset with 1190 records accumulated
from UCI repository. The dataset combines five independent ECG
dataset which gives us an extra edge to achieve our objectives.
Relation among the attributes in the dataset is analyzed before the
accuracy is calculated. Among the five classification algorithms,
Support Vector Machine outperforms other classifiers with the
accuracy of 85.49%. We hope this study will ensure early
diagnosis of heart disease and increase the chance of survival.
Keywords—Cardiovascular disease, ECG dataset, Heart dis-
ease prediction, Machine learning classifiers, Support vector
machine
I. INTRODUCTION
The term “Heart Disease”, also known as “Cardiovascular
Disease (CVD)” commonly refers to the heart condition that
affects the muscles, valves and blood vessels of the heart that
can arise a severe cardiovascular problem leading to a heart
attack. Angina (chest pain or discomfort) is reckoned as a
form of CVD where constricted or blocked blood vessels can
even endanger the life of a patient causing a heart failure.
CVDs are taken to be one of the prime causes of death all
over the world. As per the World Health Survey carried out
by World Health Organization (WHO), it is estimated that
CVD accounts for death to 17.9 million people every year,
which is about 31% of the total deaths globally [1]. Death
rates from heart diseases are highest in developed countries
like USA, Scotland, and Northern England. From the statistics
of American Heart Association performed in 2018, it’s been
surveyed that 1 of 3 deaths in the USA is caused by heart
disease [2]. Heart disease can be avoided by changing some
daily habits such as maintaining healthy diet, quitting alcohol
and tobacco intake, doing regular exercises etc. Early diagnosis
of heart disease can make the contrast between life and death
because patients can be treated before they actually become
ill. Therefore, prediction of heart disease is reckoned as one of
the prime focuses in the arena of medical data research. There
is a huge amount of raw medical data to be processed into
practicable knowledge for cardiovascular data analysis that can
help us to make decisions based on credible facts promoting
prompt predictions.
Usually, the tests a patient needs in order to diagnose heart
disease depends on what conditions the physician thinks he/she
may have. Besides blood tests, chest X-ray, electrocardiogram
(ECG), there are some conventional tests required to be
done for the diagnosis of heart disease that includes cardiac
magnetic resonance imaging (MRI), cardiac computerized
tomography (CT) scan, echocardiogram, holter monitoring,
heart catheterization, stress test, etc. Moreover, a bunch of
new techniques and models based on machine learning and
image processing have been introduced such as, medical
image fusion [3], feature fusion approach [4], prediction model
based on Weighted Associative Classifier (WAC) [5], etc. In
many developing countries, due to the scarcity of medical
professionals and lack of efficient diagnostic tools, diagnosing
heart disease and providing proper treatment are getting very
difficult.
This study aims at resolving these inconveniences by de-
veloping a prediction model applying some machine learning
algorithms which will take some medical parameters of a
patient and analyze them to forecast if the patient may have
heart disease or not. For this purpose, we have used Support
Vector Machine (SVM), Logistic regression, K-nearest neigh-
bors (KNN), Naı̈ve Bayes, and Ensemble Voting Classifier
algorithm to implement a heart disease prediction model using
publicly available ECG dataset. The main goals of our work
are:
• To analyze the comprehensive dataset consisting of
Statlog-Heart, Long Beach VA, Switzerland, Hungarian,
and Cleveland datasets by depicting the relation and
implication between the features.
• To develop five classification models using the chosen
dataset based on the fore-said algorithms.
978-1-6654-3843-8/21/$31.00 ©2021 IEEE

• To investigate the performance of the applied models
considering their accuracy for selecting the best one.
The rest of the document is outlined as follows: Section
II explores the literature review of related works we have
studied to develop our idea. Section III illustrates details of
our methodology. Section IV analyzes the performance of the
applied algorithms on the chosen dataset. At last, Section V
finishes the paper with a summary.
II. LITERATURE REVIEW
In recent times, researchers have proposed different machine
learning based techniques to detect the existence of heart
disease among patients.
In [6], with the aid of clinical evidence, certain classification
algorithms such as Naive Bayes, SVM, and KNN were used
to predict whether or not a patient has cardiopathy. With an
accuracy of 86.6%, Naive Bayes anticipates the heart disease
better than other algorithms.
In [7], dimensionality reduction was performed using two
methods including feature extraction and feature selection.
Among several supervised machine learning algorithms, SVM
performed very well in this study.
In [8], the primary goal of this analysis is to develop a heart
disease prediction system more dynamic using various sensors,
such as AliveKor, HealthGear, MyHeart, Fitbit to gather data
on heart disease to deter costly medical examinations. For
training and testing purpose, the neural network algorithm and
multi-layer perceptron techniques were implemented.
In [9], heart disease prediction models were generated using
seven classification techniques. RapidMiner Studio which is a
data science software platform was utilized to perform the
experiments. The prediction model evolved using voting clas-
sifier with nine selected features which obtained the highest
accuracy of 87.41%. The benchmarking tool was used to
assess the performance of the applied model relative to other
works.
In [10], multilayer perceptron neural network with back-
propagation has been used by the authors as the training
algorithm. The findings of the experiments demonstrate that
the proposed system based on neural networks can accurately
identify heart disease.
In [11], several classification algorithms including SVM,
Logistic Regression, Decision Tree, KNN, Naive Bayes, and
ANN were implemented. For the selection of appropriate fea-
tures, some algorithms for instance Least Absolute Shrinkage,
Minimum Redundancy Maximum Relevance, Relief, etc. were
applied. The dataset has been gone through various statistical
operations before training the models.
In [12], authors have used several machine learning meth-
ods to compare the accuracy of the heart disease diagnosis.
Without any feature selection constraints, the Hybrid Random
Forest with Linear Model (HRFLM) technique predicts CVDs
with lower classification error and higher accuracy.
In [13], authors analyzed the accuracy of each algorithms
with the support of confusion matrix while developing a model
for heart disease prediction. In this work, KNN performed
much efficiently with 87% accuracy relative to other classi-
fiers.
In [14], relief has been identified as the best feature selection
algorithm. Chest pain, exercise-induced angina, and thallium
scan, are addressed as the most preferable features. Here, con-
sidering the accuracy Logistic Regression has outperformed
considering the accuracy. On the other hand, SVM is the best
when it comes to specificity. In this study, they focused on
reducing the time of execution.
In [15], between Naive Bayes and Decision Tree, Decision
Tree has done significantly well with 19 attributes. Each
attribute’s information gain has been calculated and the highest
value of information gain is taken to construct a shorter tree.
In [16], the research demonstrates that it is critical to select
the most appropriate and influential features to maximize
the heart disease prediction result. In spite of opting for six
features among the eight features, the accuracy varied a little.
In this work, Random Forest yields the highest accuracy of
95%.
To the best of authors knowledge, there exist a few works
on the combined dataset used in this work. Apart from the
application of traditional machine learning algorithms (SVM,
Naı̈ve Bayes, Logistic Regression, KNN), this work includes
ensemble voting classifier algorithm which is one of the recent
findings incorporating multiple diverse models.
III. METHODOLOGY
Before approaching for the implementation of several ma-
chine learning algorithms and analysis of their results, we
have figured out some procedural steps and established a
methodology to achieve our objectives. Fig. 1 represents the
overall workflow of our study which sums up every required
step to proceed towards the goal. Initially a dataset based on
ECG report is read from the CSV file. The parameters of
the dataset are studied and preprocessed before applying the
algorithms to predict a result.
The following sub-sections go into greater detail on our
workflow.
A. Data Collection
To accomplish our goal, we have started with the data
collection process from UCI repository datasets which are well
verified by the researcher community. We have collected the
dataset which is a combination of five popular independent
datasets available in UCI machine learning repository. It is
basically an ECG dataset and is combined over 12 common
attributes from the five constituent datasets resulting in 1190
records in total which can be claimed as the largest CVD
dataset [17] available for the research practitioners. The dataset
contains common medical parameters related to heart condi-
tion along with the information of comorbidities. The details
of five constituent datasets are exhibited in Table I.

Fig. 1. Overall workflow.
TABLE I. Dataset Overview
Name of
Dataset
Number
of Data
Source
Cleveland
Dataset
303 Cleveland Clinic Foundation: Robert
Detrano, M.D., Ph.D.
Hungarian
Dataset
294 Hungarian Institute of Cardiology. Bu-
dapest: Andras Janosi, M.D.
Switzerland
Dataset
123 University Hospital, Zurich, Switzer-
land: William Steinbrunn, M.D.
Long Beach VA
Dataset
200 V.A. Medical Center, Long Beach
Stalog (Heart)
Dataset
270 University Hospital, Basel, Switzer-
land: Matthias Pfisterer, M.D.
Total Data 1190
B. Feature Description
The dataset holds 1190 records of patients from four differ-
ent countries (UK, US, Hungary and Switzerland). It consists
of 11 features and 1 target variable as exhibited in Table II.
C. Preprocessing of Data
While working with enormous amount of diverse data,
we had to preprocess the dataset. It helps to improve data
efficiency in order to facilitate practical insights and to obtain
better result from the system. Preprocessing of data converts
unprocessed data into a comprehensible and readable format.
For preprocessing, we have conducted the following steps:
1) Identifying and handling missing values: If we fail to
find and resolve missing values appropriately, we may fail to
draw an accurate conclusion. When there are enough samples
in the dataset, a specific row that holds null values is removed
to avoid the addition of bias. In another method, the missing
value can be replaced with the mean, median or mode of a
specific attribute, which is applicable for numerical data. We
imported a python library pandas to apply isnull() function
for detecting missing values and ended up encountering with
TABLE II. Attributes of the Dataset
SL
No.
Attribute Name Attribute Description
1 Age Patient’s age in years
2 Sex 1 = male; 0 = female
3 Chest pain Chest Pain Type and ranges from 0-3
depending upon the symptoms experi-
enced by a patient.
4 Resting BPS Resting blood pressure (in mm Hg after
being admitted to the hospital)
5 Cholesterol Serum cholestoral in mg/dl .
6 Fasting blood sugar Fastingbloodsugar > 120mg/dl
(1 = signifies a blood sugar level in
excess of 120mg/dl; 0 = signifies a
blood sugar level lower than 120mg/dl)
7 Resting ECG Resting electrocardiographic results
8 Max heart rate The maximum heart rate of an individ-
ual using a Thallium Test. Measured in
beats per minute.
9 Exercise angina Exercise triggered angina (1 = yes; 0 =
no)
10 Oldpeak ST depression induced by exercise rel-
ative to rest
11 ST slope The slope of the peak exercise ST seg-
ment
12 Target 1 or 0 ( 1 = heart attack may happen,
0 = heart attack may not happen)
zero null value. This forecasts the efficacy and completeness
of this dataset.
2) Data balancing: If a dataset contains positive values
whose amount is approximately same as negative values, then
the dataset is said to be balanced. Some machine learning
classifiers struggle with imbalanced training datasets as they
are vulnerable to the proportions of the different classes [18].
Fig. 2 shows the target classes where “1” represents patient
having heart disease and “0” represents patients not having
heart disease. The number of patients with heart disease is
629 whereas 561 patients have no heart disease. Thus from
the figure, it can be observed that the target classes contain
nearly equal number of entries which indicates a balanced
dataset.
Fig. 2. Distribution of target class.

Fig. 3. Age variation for every target class.
D. Data Analysis
In Fig. 3, the variation of age for each target class is pictured
where it is visible that people aging around 50-65 are more
prone to have heart disease compared to people of other age
classes. This graph depicts the probable tendency of having
heart disease in a certain age group.
Fig. 4. Correlation between different features.
Fig. 4 illustrates a correlation heatmap. Correlation explains
how one or more input data features are connected to one
another to predict the target variable. Here, strength of the
correlation ranges from -1 to +1. Values closer to zero means
there is weaker linear relationship between the two variables.
Values close to 1 refers that the variables are more positively
correlated whereas values close to -1 are more negatively
correlated. The most positive correlation between two or more
variables tends to hold the darkest shade of green in the
heatmap, while the most negative correlation tends to take
on the darkest tone of red. From this correlation heatmap, we
can ascertain that ST slope is the most positively correlated
feature to the target (0.51) whereas max heart rate has the
most negative correlation with the target (-0.41).
E. Training and Testing
Five significant machine learning classification methods
namely KNN, Naive Bayes, SVM, Logistic Regression, and
Ensemble voting classifier are used to develop prediction
models on the dataset. Ensemble voting classifier is a hybrid
classifier implemented with Logistic Regression, Random For-
est and Naive Bayes. Before implementing the algorithms, the
dataset is split into two portions including train set and test set.
To train the machine learning model, a train dataset is used
where this subset of the data already knows the corresponding
output. Whereas the testing set is used to predict the outcome
of the model. We have split our dataset into train-test ratio
of 67:33. Here, the training dataset takes 797 records of the
total data leaving the rest 393 records for testing purpose.The
accuracy is calculated by comparing the actual response values
with predicted response values. After building the prediction
model we can make predictions on out of sample data to make
sure that the model is ready to do heart disease prediction.
IV. RESULT ANALYSIS
Python programming is used to implement the classification
algorithms as it offers the most versatile and enriched libraries.
After going through the fore-mentioned steps the machine
learning classifiers are trained on the chosen dataset. Numpy

TABLE III. Comparison Among Algorithms
Algorithms Training Accuracy Testing Accuracy Precision Recall F-measure
SVM 82.56% 85.49% 87.05% 87.44% 87.24%
Naive Bayes 83.18% 84.98% 87.27% 86.09% 86.67%
Logistic Regression 81.05% 84.47% 86.49% 86.09% 86.29%
Ensemble Vote Classifier 84.06% 84.11% 89.14% 88.34% 88.64%
KNN 75.28% 75.31% 80.58% 74.44% 77.39%
python library is used to perform mathematical operation on
confusion matrix to yield the accuracy. Confusion matrix fa-
cilitates the evaluation of the model for performance analysis.
A confusion matrix represents a table layout of the different
outcomes of the prediction that helps to visualize the results.
Usually, it generates four outcomes as following:
• True Positive (TP): The number of accurately identified
actual positive values
• True Negative (TN): The number of accurately identified
actual negative values
• False Positive (FP): The number of times the negative
values are predicted as positive
• False Negative (FN) : The number of times the positive
values are predicted as negative
The confusion matrix of SVM algorithm is given in Table IV.
TABLE IV. Confusion Matrix of SVM Classifier
Total Tests = 393
Predicted
No
Predicted
Yes
Actual
No
141 29
Actual
Yes
28 195
Just by observing the confusion matrix, the performance
of the model can not be depicted clearly. To determine how
accurate the model is, accuracy, precision, recall, f-measure are
calculated using (1), (2), (3), (4) respectively. The estimation
of properly classified values are determined by accuracy. It
tells how often our classifier is predicting right. It is the total
of all true values divided by total values.
Accuracy =
True Predictions
Total Predictions
(1)
Precision =
True Positives
False Positives + True Positives
(2)
Recall =
True Positives
False Negatives + True Positives
(3)
F − Measure =
2 ∗ Recall ∗ Precision
Recall + Precision
(4)
Table III shows the comparison of the implemented al-
gorithms based on the aforementioned performance metrics.
From the table, it is clear that SVM algorithm has outper-
formed other classification algorithms with the accuracy of
85.49% as it aims to find the best hyperplane (also called
decision boundary) in a binary classification model. It best
splits our dataset into two classes yielding the prediction of
whether a patient would have CVD or not. On the other hand,
KNN has shown the lowest accuracy (75.31%) among these
algorithms as it works well with a small number of input
variables, but struggles when the number of input is very large.
Fig. 5 shows in depth analysis of the models by plotting the
total number of wrong predicted values with respect to each
classification algorithm. The total number of wrong predicted
values on the test set are calculated from the confusion matrix.
Here, (FP + FN) is calculated to yield the total number of
wrong predictions. SVM generates least amount of wrongly
predicted values. On the contrary, KNN encounters the highest
amount of wrong predictions as it holds the least accuracy.
From the investigation, it can be inferred that SVM performs
much competently as compared to other algorithms.
S
V
M
N
a
i
v
e
B
a
y
e
s
L
o
g
i
s
t
i
c
R
e
g
r
e
s
s
i
o
n
E
n
s
e
m
b
l
e
V
o
t
e
C
l
a
s
s
i
fi
e
r
K
N
N
0
20
40
60
80
100
Prediction Models
Total
Wrong
Predicted
Values
Fig. 5. Wrong predicted values for the applied algorithms.
In Table V, the performance of our work and other existing

TABLE V. Comparison with Other Existing Work
Work Reference
Accuracy Using
SVM
Ours 85.49%
[6] 77.7%
[11] 83%
[13] 85%
works are compared based on the accuracy obtained by the
best performing algorithm in our work i.e. SVM. We attained
better accuracy in this case by using combined dataset (1190
records) whereas the other works applied the algorithm on the
Cleveland dataset (303 records).
V. CONCLUSION
The number of deaths due to heart disease is increasing day
by day for the lack of early prognosis and timely treatment. In
case of heart disease, early diagnosis can accelerate the chance
of survival and also reduce the associated health complexities.
In this work, we have developed a prediction model to detect
heart disease based on some parameters derived from ECG
reports. To achieve this goal, five fore-mentioned classification
algorithms have been applied on a publicly available dataset
with 1190 records which is a combination of five popular
datasets in this field. Among the applied classifiers, Ensemble
Voting classifier is a hybrid technique combining Logistic
Regression, Random Forest and Naive Bayes. From our anal-
ysis, it is observed that SVM performs realistically well with
85.49% accuracy. Further enhancement of this study is that a
large-scale dataset can be collected from our native hospitals as
individual health also depends on regions and socio-economic
factors. In future, more analysis can be performed with the
different combination of algorithms used in hybrid techniques
to obtain a better performing heart disease prediction model.
REFERENCES
[1] T. WHO. (2016) Cardiovascular diseases. [Online]. Available: https:
//www.who.int/health-topics/cardiovascular-diseases
[2] R. Cementa. (2018) Heart attacks: Men vs. women. [Online]. Avail-
able: https://www.caringseniorservice.com/blog/heart-attacks-men-vs.
-women?
[3] M. Diwakar, A. Tripathi, K. Joshi, M. Memoria, P. Singh et al., “Latest
trends on heart disease prediction using machine learning and image
fusion,” Materials Today: Proceedings, vol. 37, pp. 3213–3218, 2021.
[4] F. Ali, S. El-Sappagh, S. R. Islam, D. Kwak, A. Ali, M. Imran,
and K.-S. Kwak, “A smart healthcare monitoring system for heart
disease prediction based on ensemble deep learning and feature fusion,”
Information Fusion, vol. 63, pp. 208–222, 2020.
[5] J. Soni, U. Ansari, D. Sharma, and S. Soni, “Intelligent and effective
heart disease prediction system using weighted associative classifiers,”
International Journal on Computer Science and Engineering, vol. 3,
no. 6, pp. 2385–2392, 2011.
[6] S. Anitha and N. Sridevi, “Heart disease prediction using data mining
techniques,” Journal of Analysis and Computation, 2019.
[7] V. Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease predic-
tion using machine learning techniques: a survey,” International Journal
of Engineering & Technology, vol. 7, no. 2.8, pp. 684–687, 2018.
[8] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, “Prediction of
heart disease using machine learning,” in 2018 Second International
Conference on Electronics, Communication and Aerospace Technology
(ICECA). IEEE, 2018, pp. 1275–1278.
[9] M. S. Amin, Y. K. Chiam, and K. D. Varathan, “Identification of
significant features and data mining techniques in predicting heart
disease,” Telematics and Informatics, vol. 36, pp. 82–93, 2019.
[10] P. Singh, S. Singh, and G. S. Pandi-Jain, “Effective heart disease
prediction system using data mining techniques,” International journal
of nanomedicine, vol. 13, no. T-NANO 2014 Abstracts, p. 121, 2018.
[11] J. P. Li, A. U. Haq, S. U. Din, J. Khan, A. Khan, and A. Saboor, “Heart
disease identification method using machine learning classification in
e-healthcare,” IEEE Access, vol. 8, pp. 107 562–107 582, 2020.
[12] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease
prediction using hybrid machine learning techniques,” IEEE Access,
vol. 7, pp. 81 542–81 554, 2019.
[13] A. Singh and R. Kumar, “Heart disease prediction using machine
learning algorithms,” in 2020 international conference on electrical and
electronics engineering (ICE3). IEEE, 2020, pp. 452–457.
[14] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, “A hybrid
intelligent system framework for the prediction of heart disease using
machine learning algorithms,” Mobile Information Systems, vol. 2018,
2018.
[15] S. Nikhar and A. Karandikar, “Prediction of heart disease using machine
learning algorithms,” International Journal of Advanced Engineering,
Management and Science, vol. 2, no. 6, p. 239484, 2016.
[16] N. S. C. Reddy, S. S. Nee, L. Z. Min, and C. X. Ying, “Classification
and feature selection approaches by machine learning techniques: Heart
disease prediction,” International Journal of Innovative Computing,
vol. 9, no. 1, 2019.
[17] M. Siddharta. (2019) Heart disease dataset (most compre-
hensive). [Online]. Available: https://www.kaggle.com/sid321axn/
heart-statlog-cleveland-hungary-final
[18] D. J. Dittman, T. M. Khoshgoftaar, and A. Napolitano, “The effect of
data sampling when using random forest on imbalanced bioinformatics
data,” in 2015 IEEE international conference on information reuse and
integration. IEEE, 2015, pp. 457–463.

238_heartdisease (1).pdf

Recommended

Recommended

More Related Content

Similar to 238_heartdisease (1).pdf

Similar to 238_heartdisease (1).pdf (20)

Recently uploaded

Recently uploaded (20)

238_heartdisease (1).pdf