2. • To investigate the performance of the applied models
considering their accuracy for selecting the best one.
The rest of the document is outlined as follows: Section
II explores the literature review of related works we have
studied to develop our idea. Section III illustrates details of
our methodology. Section IV analyzes the performance of the
applied algorithms on the chosen dataset. At last, Section V
finishes the paper with a summary.
II. LITERATURE REVIEW
In recent times, researchers have proposed different machine
learning based techniques to detect the existence of heart
disease among patients.
In [6], with the aid of clinical evidence, certain classification
algorithms such as Naive Bayes, SVM, and KNN were used
to predict whether or not a patient has cardiopathy. With an
accuracy of 86.6%, Naive Bayes anticipates the heart disease
better than other algorithms.
In [7], dimensionality reduction was performed using two
methods including feature extraction and feature selection.
Among several supervised machine learning algorithms, SVM
performed very well in this study.
In [8], the primary goal of this analysis is to develop a heart
disease prediction system more dynamic using various sensors,
such as AliveKor, HealthGear, MyHeart, Fitbit to gather data
on heart disease to deter costly medical examinations. For
training and testing purpose, the neural network algorithm and
multi-layer perceptron techniques were implemented.
In [9], heart disease prediction models were generated using
seven classification techniques. RapidMiner Studio which is a
data science software platform was utilized to perform the
experiments. The prediction model evolved using voting clas-
sifier with nine selected features which obtained the highest
accuracy of 87.41%. The benchmarking tool was used to
assess the performance of the applied model relative to other
works.
In [10], multilayer perceptron neural network with back-
propagation has been used by the authors as the training
algorithm. The findings of the experiments demonstrate that
the proposed system based on neural networks can accurately
identify heart disease.
In [11], several classification algorithms including SVM,
Logistic Regression, Decision Tree, KNN, Naive Bayes, and
ANN were implemented. For the selection of appropriate fea-
tures, some algorithms for instance Least Absolute Shrinkage,
Minimum Redundancy Maximum Relevance, Relief, etc. were
applied. The dataset has been gone through various statistical
operations before training the models.
In [12], authors have used several machine learning meth-
ods to compare the accuracy of the heart disease diagnosis.
Without any feature selection constraints, the Hybrid Random
Forest with Linear Model (HRFLM) technique predicts CVDs
with lower classification error and higher accuracy.
In [13], authors analyzed the accuracy of each algorithms
with the support of confusion matrix while developing a model
for heart disease prediction. In this work, KNN performed
much efficiently with 87% accuracy relative to other classi-
fiers.
In [14], relief has been identified as the best feature selection
algorithm. Chest pain, exercise-induced angina, and thallium
scan, are addressed as the most preferable features. Here, con-
sidering the accuracy Logistic Regression has outperformed
considering the accuracy. On the other hand, SVM is the best
when it comes to specificity. In this study, they focused on
reducing the time of execution.
In [15], between Naive Bayes and Decision Tree, Decision
Tree has done significantly well with 19 attributes. Each
attribute’s information gain has been calculated and the highest
value of information gain is taken to construct a shorter tree.
In [16], the research demonstrates that it is critical to select
the most appropriate and influential features to maximize
the heart disease prediction result. In spite of opting for six
features among the eight features, the accuracy varied a little.
In this work, Random Forest yields the highest accuracy of
95%.
To the best of authors knowledge, there exist a few works
on the combined dataset used in this work. Apart from the
application of traditional machine learning algorithms (SVM,
Naı̈ve Bayes, Logistic Regression, KNN), this work includes
ensemble voting classifier algorithm which is one of the recent
findings incorporating multiple diverse models.
III. METHODOLOGY
Before approaching for the implementation of several ma-
chine learning algorithms and analysis of their results, we
have figured out some procedural steps and established a
methodology to achieve our objectives. Fig. 1 represents the
overall workflow of our study which sums up every required
step to proceed towards the goal. Initially a dataset based on
ECG report is read from the CSV file. The parameters of
the dataset are studied and preprocessed before applying the
algorithms to predict a result.
The following sub-sections go into greater detail on our
workflow.
A. Data Collection
To accomplish our goal, we have started with the data
collection process from UCI repository datasets which are well
verified by the researcher community. We have collected the
dataset which is a combination of five popular independent
datasets available in UCI machine learning repository. It is
basically an ECG dataset and is combined over 12 common
attributes from the five constituent datasets resulting in 1190
records in total which can be claimed as the largest CVD
dataset [17] available for the research practitioners. The dataset
contains common medical parameters related to heart condi-
tion along with the information of comorbidities. The details
of five constituent datasets are exhibited in Table I.
3. Fig. 1. Overall workflow.
TABLE I. Dataset Overview
Name of
Dataset
Number
of Data
Source
Cleveland
Dataset
303 Cleveland Clinic Foundation: Robert
Detrano, M.D., Ph.D.
Hungarian
Dataset
294 Hungarian Institute of Cardiology. Bu-
dapest: Andras Janosi, M.D.
Switzerland
Dataset
123 University Hospital, Zurich, Switzer-
land: William Steinbrunn, M.D.
Long Beach VA
Dataset
200 V.A. Medical Center, Long Beach
Stalog (Heart)
Dataset
270 University Hospital, Basel, Switzer-
land: Matthias Pfisterer, M.D.
Total Data 1190
B. Feature Description
The dataset holds 1190 records of patients from four differ-
ent countries (UK, US, Hungary and Switzerland). It consists
of 11 features and 1 target variable as exhibited in Table II.
C. Preprocessing of Data
While working with enormous amount of diverse data,
we had to preprocess the dataset. It helps to improve data
efficiency in order to facilitate practical insights and to obtain
better result from the system. Preprocessing of data converts
unprocessed data into a comprehensible and readable format.
For preprocessing, we have conducted the following steps:
1) Identifying and handling missing values: If we fail to
find and resolve missing values appropriately, we may fail to
draw an accurate conclusion. When there are enough samples
in the dataset, a specific row that holds null values is removed
to avoid the addition of bias. In another method, the missing
value can be replaced with the mean, median or mode of a
specific attribute, which is applicable for numerical data. We
imported a python library pandas to apply isnull() function
for detecting missing values and ended up encountering with
TABLE II. Attributes of the Dataset
SL
No.
Attribute Name Attribute Description
1 Age Patient’s age in years
2 Sex 1 = male; 0 = female
3 Chest pain Chest Pain Type and ranges from 0-3
depending upon the symptoms experi-
enced by a patient.
4 Resting BPS Resting blood pressure (in mm Hg after
being admitted to the hospital)
5 Cholesterol Serum cholestoral in mg/dl .
6 Fasting blood sugar Fastingbloodsugar > 120mg/dl
(1 = signifies a blood sugar level in
excess of 120mg/dl; 0 = signifies a
blood sugar level lower than 120mg/dl)
7 Resting ECG Resting electrocardiographic results
8 Max heart rate The maximum heart rate of an individ-
ual using a Thallium Test. Measured in
beats per minute.
9 Exercise angina Exercise triggered angina (1 = yes; 0 =
no)
10 Oldpeak ST depression induced by exercise rel-
ative to rest
11 ST slope The slope of the peak exercise ST seg-
ment
12 Target 1 or 0 ( 1 = heart attack may happen,
0 = heart attack may not happen)
zero null value. This forecasts the efficacy and completeness
of this dataset.
2) Data balancing: If a dataset contains positive values
whose amount is approximately same as negative values, then
the dataset is said to be balanced. Some machine learning
classifiers struggle with imbalanced training datasets as they
are vulnerable to the proportions of the different classes [18].
Fig. 2 shows the target classes where “1” represents patient
having heart disease and “0” represents patients not having
heart disease. The number of patients with heart disease is
629 whereas 561 patients have no heart disease. Thus from
the figure, it can be observed that the target classes contain
nearly equal number of entries which indicates a balanced
dataset.
Fig. 2. Distribution of target class.
4. Fig. 3. Age variation for every target class.
D. Data Analysis
In Fig. 3, the variation of age for each target class is pictured
where it is visible that people aging around 50-65 are more
prone to have heart disease compared to people of other age
classes. This graph depicts the probable tendency of having
heart disease in a certain age group.
Fig. 4. Correlation between different features.
Fig. 4 illustrates a correlation heatmap. Correlation explains
how one or more input data features are connected to one
another to predict the target variable. Here, strength of the
correlation ranges from -1 to +1. Values closer to zero means
there is weaker linear relationship between the two variables.
Values close to 1 refers that the variables are more positively
correlated whereas values close to -1 are more negatively
correlated. The most positive correlation between two or more
variables tends to hold the darkest shade of green in the
heatmap, while the most negative correlation tends to take
on the darkest tone of red. From this correlation heatmap, we
can ascertain that ST slope is the most positively correlated
feature to the target (0.51) whereas max heart rate has the
most negative correlation with the target (-0.41).
E. Training and Testing
Five significant machine learning classification methods
namely KNN, Naive Bayes, SVM, Logistic Regression, and
Ensemble voting classifier are used to develop prediction
models on the dataset. Ensemble voting classifier is a hybrid
classifier implemented with Logistic Regression, Random For-
est and Naive Bayes. Before implementing the algorithms, the
dataset is split into two portions including train set and test set.
To train the machine learning model, a train dataset is used
where this subset of the data already knows the corresponding
output. Whereas the testing set is used to predict the outcome
of the model. We have split our dataset into train-test ratio
of 67:33. Here, the training dataset takes 797 records of the
total data leaving the rest 393 records for testing purpose.The
accuracy is calculated by comparing the actual response values
with predicted response values. After building the prediction
model we can make predictions on out of sample data to make
sure that the model is ready to do heart disease prediction.
IV. RESULT ANALYSIS
Python programming is used to implement the classification
algorithms as it offers the most versatile and enriched libraries.
After going through the fore-mentioned steps the machine
learning classifiers are trained on the chosen dataset. Numpy
5. TABLE III. Comparison Among Algorithms
Algorithms Training Accuracy Testing Accuracy Precision Recall F-measure
SVM 82.56% 85.49% 87.05% 87.44% 87.24%
Naive Bayes 83.18% 84.98% 87.27% 86.09% 86.67%
Logistic Regression 81.05% 84.47% 86.49% 86.09% 86.29%
Ensemble Vote Classifier 84.06% 84.11% 89.14% 88.34% 88.64%
KNN 75.28% 75.31% 80.58% 74.44% 77.39%
python library is used to perform mathematical operation on
confusion matrix to yield the accuracy. Confusion matrix fa-
cilitates the evaluation of the model for performance analysis.
A confusion matrix represents a table layout of the different
outcomes of the prediction that helps to visualize the results.
Usually, it generates four outcomes as following:
• True Positive (TP): The number of accurately identified
actual positive values
• True Negative (TN): The number of accurately identified
actual negative values
• False Positive (FP): The number of times the negative
values are predicted as positive
• False Negative (FN) : The number of times the positive
values are predicted as negative
The confusion matrix of SVM algorithm is given in Table IV.
TABLE IV. Confusion Matrix of SVM Classifier
Total Tests = 393
Predicted
No
Predicted
Yes
Actual
No
141 29
Actual
Yes
28 195
Just by observing the confusion matrix, the performance
of the model can not be depicted clearly. To determine how
accurate the model is, accuracy, precision, recall, f-measure are
calculated using (1), (2), (3), (4) respectively. The estimation
of properly classified values are determined by accuracy. It
tells how often our classifier is predicting right. It is the total
of all true values divided by total values.
Accuracy =
True Predictions
Total Predictions
(1)
Precision =
True Positives
False Positives + True Positives
(2)
Recall =
True Positives
False Negatives + True Positives
(3)
F − Measure =
2 ∗ Recall ∗ Precision
Recall + Precision
(4)
Table III shows the comparison of the implemented al-
gorithms based on the aforementioned performance metrics.
From the table, it is clear that SVM algorithm has outper-
formed other classification algorithms with the accuracy of
85.49% as it aims to find the best hyperplane (also called
decision boundary) in a binary classification model. It best
splits our dataset into two classes yielding the prediction of
whether a patient would have CVD or not. On the other hand,
KNN has shown the lowest accuracy (75.31%) among these
algorithms as it works well with a small number of input
variables, but struggles when the number of input is very large.
Fig. 5 shows in depth analysis of the models by plotting the
total number of wrong predicted values with respect to each
classification algorithm. The total number of wrong predicted
values on the test set are calculated from the confusion matrix.
Here, (FP + FN) is calculated to yield the total number of
wrong predictions. SVM generates least amount of wrongly
predicted values. On the contrary, KNN encounters the highest
amount of wrong predictions as it holds the least accuracy.
From the investigation, it can be inferred that SVM performs
much competently as compared to other algorithms.
S
V
M
N
a
i
v
e
B
a
y
e
s
L
o
g
i
s
t
i
c
R
e
g
r
e
s
s
i
o
n
E
n
s
e
m
b
l
e
V
o
t
e
C
l
a
s
s
i
fi
e
r
K
N
N
0
20
40
60
80
100
Prediction Models
Total
Wrong
Predicted
Values
Fig. 5. Wrong predicted values for the applied algorithms.
In Table V, the performance of our work and other existing
6. TABLE V. Comparison with Other Existing Work
Work Reference
Accuracy Using
SVM
Ours 85.49%
[6] 77.7%
[11] 83%
[13] 85%
works are compared based on the accuracy obtained by the
best performing algorithm in our work i.e. SVM. We attained
better accuracy in this case by using combined dataset (1190
records) whereas the other works applied the algorithm on the
Cleveland dataset (303 records).
V. CONCLUSION
The number of deaths due to heart disease is increasing day
by day for the lack of early prognosis and timely treatment. In
case of heart disease, early diagnosis can accelerate the chance
of survival and also reduce the associated health complexities.
In this work, we have developed a prediction model to detect
heart disease based on some parameters derived from ECG
reports. To achieve this goal, five fore-mentioned classification
algorithms have been applied on a publicly available dataset
with 1190 records which is a combination of five popular
datasets in this field. Among the applied classifiers, Ensemble
Voting classifier is a hybrid technique combining Logistic
Regression, Random Forest and Naive Bayes. From our anal-
ysis, it is observed that SVM performs realistically well with
85.49% accuracy. Further enhancement of this study is that a
large-scale dataset can be collected from our native hospitals as
individual health also depends on regions and socio-economic
factors. In future, more analysis can be performed with the
different combination of algorithms used in hybrid techniques
to obtain a better performing heart disease prediction model.
REFERENCES
[1] T. WHO. (2016) Cardiovascular diseases. [Online]. Available: https:
//www.who.int/health-topics/cardiovascular-diseases
[2] R. Cementa. (2018) Heart attacks: Men vs. women. [Online]. Avail-
able: https://www.caringseniorservice.com/blog/heart-attacks-men-vs.
-women?
[3] M. Diwakar, A. Tripathi, K. Joshi, M. Memoria, P. Singh et al., “Latest
trends on heart disease prediction using machine learning and image
fusion,” Materials Today: Proceedings, vol. 37, pp. 3213–3218, 2021.
[4] F. Ali, S. El-Sappagh, S. R. Islam, D. Kwak, A. Ali, M. Imran,
and K.-S. Kwak, “A smart healthcare monitoring system for heart
disease prediction based on ensemble deep learning and feature fusion,”
Information Fusion, vol. 63, pp. 208–222, 2020.
[5] J. Soni, U. Ansari, D. Sharma, and S. Soni, “Intelligent and effective
heart disease prediction system using weighted associative classifiers,”
International Journal on Computer Science and Engineering, vol. 3,
no. 6, pp. 2385–2392, 2011.
[6] S. Anitha and N. Sridevi, “Heart disease prediction using data mining
techniques,” Journal of Analysis and Computation, 2019.
[7] V. Ramalingam, A. Dandapath, and M. K. Raja, “Heart disease predic-
tion using machine learning techniques: a survey,” International Journal
of Engineering & Technology, vol. 7, no. 2.8, pp. 684–687, 2018.
[8] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, “Prediction of
heart disease using machine learning,” in 2018 Second International
Conference on Electronics, Communication and Aerospace Technology
(ICECA). IEEE, 2018, pp. 1275–1278.
[9] M. S. Amin, Y. K. Chiam, and K. D. Varathan, “Identification of
significant features and data mining techniques in predicting heart
disease,” Telematics and Informatics, vol. 36, pp. 82–93, 2019.
[10] P. Singh, S. Singh, and G. S. Pandi-Jain, “Effective heart disease
prediction system using data mining techniques,” International journal
of nanomedicine, vol. 13, no. T-NANO 2014 Abstracts, p. 121, 2018.
[11] J. P. Li, A. U. Haq, S. U. Din, J. Khan, A. Khan, and A. Saboor, “Heart
disease identification method using machine learning classification in
e-healthcare,” IEEE Access, vol. 8, pp. 107 562–107 582, 2020.
[12] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease
prediction using hybrid machine learning techniques,” IEEE Access,
vol. 7, pp. 81 542–81 554, 2019.
[13] A. Singh and R. Kumar, “Heart disease prediction using machine
learning algorithms,” in 2020 international conference on electrical and
electronics engineering (ICE3). IEEE, 2020, pp. 452–457.
[14] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, “A hybrid
intelligent system framework for the prediction of heart disease using
machine learning algorithms,” Mobile Information Systems, vol. 2018,
2018.
[15] S. Nikhar and A. Karandikar, “Prediction of heart disease using machine
learning algorithms,” International Journal of Advanced Engineering,
Management and Science, vol. 2, no. 6, p. 239484, 2016.
[16] N. S. C. Reddy, S. S. Nee, L. Z. Min, and C. X. Ying, “Classification
and feature selection approaches by machine learning techniques: Heart
disease prediction,” International Journal of Innovative Computing,
vol. 9, no. 1, 2019.
[17] M. Siddharta. (2019) Heart disease dataset (most compre-
hensive). [Online]. Available: https://www.kaggle.com/sid321axn/
heart-statlog-cleveland-hungary-final
[18] D. J. Dittman, T. M. Khoshgoftaar, and A. Napolitano, “The effect of
data sampling when using random forest on imbalanced bioinformatics
data,” in 2015 IEEE international conference on information reuse and
integration. IEEE, 2015, pp. 457–463.