1. A Survey on Stroke Prediction
Drubojit Saha 170104027
Mohammad Rakib-Uz-Zaman 170104041
Tasnim Nusrat Hasan 170104046
Project Report
Course ID: CSE 4214
Course Name: Pattern Recognition Lab
Semester: Fall 2020
.
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
30 September 2021
2. A Survey on Stroke Prediction
Submitted by
Drubojit Saha 170104027
Mohammad Rakib-Uz-Zaman 170104041
Tasnim Nusrat Hasan 170104046
Submitted To
Faisal Muhammad Shah, Associate Professor
Farzad Ahmed, Lecturer
Md. Tanvir Rouf Shawon, Lecturer
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
.
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
30 September 2021
3. ABSTRACT
Strokes have rapidly raised globally even at not only in older ages but also in juve-
nile ages. Stroke prediction is found to be a compound task which requires enormous
amount of data pre-processing also there is a need to automatize the prophecy process
for the early exposure of symptoms related to stroke so that it can be averted at an early
stage. In this research, stroke prediction is observed on two dataset collected from Kag-
gle (Both benchmark and non Benchmark dataset) in the proposed models. Different
models anticipate the chances a person will have stroke based on symptoms like age,
gender, average glucose level, smoking status, body mass index, work type and resi-
dence type. Each model analyzes the person’s risk level by performing various machine
learning algorithms like Gaussian Naive Bayes, K-Nearest Neighbor (KNN), Decision
Tree and Support Vector Machine (SVM). Hence, a provisional study is shown between
the different algorithms and the most competent one is obtained.
i
9. 1
Chapter 1
Introduction
A stroke is a brain attack, cutting off crucial blood flow and oxygen to the brain. Stroke
arises when a blood artery feeding the brain gets clogged or bursts. Identifying stroke and
taking medical action immediately can not only lengthen life but also help to prevent heart
disease in the future.
Machine learning has become one of the most exacting field in modern technology. Stroke
prediction in grownups can be done by using various machine learning algorithms. It has
become a captivated research problem as there are various parameters that can effect the
outcome. The causes consist of work type, glucose level, body mass index, gender, residence
type, age, average smoking status of the individual and any previous heart disease.
The proposed models foresee stroke prediction of several individuals using various machine
learning algorithms like K-Nearest Neighbors, Decision Tree Classifier, Support Vector Ma-
chine, and Gaussian Naive Bayes based on features mentioned above which have been taken
from the dataset on which the model has been trained.
10. 2
Chapter 2
Literature Reviews
In [1] various classification algorithms are studied and the most accurate model is obtained
for predicting the stroke in the patient. It was found that Decision Tree and SVM were the
most efficient algorithms while KNN was found to be the most ineffective one.
In [2] different oversampling techniques are mentioned for imbalance dataset. This survey
work also finds out some of the major flaws associated with stroke related topics, so that a
suitable solution can be proposed in order to overcome the disease.
11. 3
Chapter 3
Data Collection & Processing
3.1 Dataset
The datasets have been taken from the Kaggle website (Both Benchmark and non Benchmark
Dataset). Fig 3.1 shows the benchmark dataset has 5110 rows and 11 columns. The features
are consist of gender, age, work type, residence type, average glucose level, body mass index
(BMI), hypertension, heart disease, ever married, smoking status. The target column is
stroke. On the other hand, Fig 3.2 shows the non benchmark dataset that has 31652 rows
and 11 columns. The attributes are same as the benchmark dataset. The target column is
stroke. All the other features for both the dataset have been used for training the model
except the id.
Figure 3.1: Benchmark Dataset
12. 3.2. DATASET PRE-PROCESSING 4
Figure 3.2: Non-benchmark Dataset
3.2 Dataset Pre-processing
The benchmark dataset contains 201 null value in BMI attribute which are replaced by
calculating the mean value of BMI. Lebel encoding technique is used for converting the
gender, ever married, work type, resident type, smoking status features label’s into a numeric
value so that the machine can read the values in a machine-readable form. The attribute
named average glucose level and BMI data are normalized which means it chanes the values
of numeric columns to a common scale between 0 to 1. Then the both the dataset are spillted
into 80 to 20 ratio.
3.3 Feature Engineering in Train Data
Feature engineering indicates to the transformation of using domain knowledge to select
and convert the most compatible features from raw data when making a predictive model
using machine learning.
3.3.1 Handling Imbalance Data
3.3.1.1 Borderline SMOTE+Random Undersampling
Borderline SMOTE algorithm is used for oversampling from minority class with undersam-
pling algorithm Random undersampling as hybrid technique for Balancing the imbalance
data. Borderline SMOTE algorithm starts by classifying the minority class observations. It
uses KNN algorithm for this classification. It classifies any minority observation as a noise
point if all the neighbors of that point are the majority class and such an observation is ig-
13. 3.3. FEATURE ENGINEERING IN TRAIN DATA 5
nored while creating synthetic data. It classifies some points as borders points that have both
majority and minority class as neighborhood and resample completely from these points.
Random undersampling works by randomly selecting observations from the majority class
and deleting them from the training dataset.
3.3.1.2 SVMSMOTE+Random Undersampling
SVMSMOTE algorithm is used for oversampling from minority class with undersampling
algorithm Random undersampling as hybrid technique for Balancing the imbalance data.
SVMSMOTE is an alternative of Borderline SMOTE as it uses SVM algorithm instead of KNN
for classifying the minority class observations as mentioned in the above section. Except this
difference, this algorithm works same as Borderline SMOTE algorithm as mentioned in the
above section. Random undersampling works by randomly selecting observations from the
majority class and deleting them from the training dataset.
3.3.1.3 SMOTE+Tomek
SMOTE+TOMEK is a hybrid technique that focuses to clean overlapping data points for each
of the classes assigned in sample space. Tomek links are anointed to oversampled minority
class samples which is done by SMOTE. As a result, rather than discarding the observations
only from the majority class, it eliminate both the class observations from the Tomek links.
3.3.1.4 SMOTE+ENN
SMOTE + ENN is other type of hybrid technique where a major number of observations are
cleared from the sample space. Though ENN is another undersampling technique where
the adjacent neighbors of each of the majority class is measured. If the adjacent neighbors
misclassify that apecific instance of the majority class, then the instance is eliminated.
3.3.2 Feature Engineering
Feature engineering introduces to the procedure of using domain knowledge to choose and
transform the most relevant features when establishing a predictive model using machine
learning model. Its aim is to improve the performance of machine learning (ML) algorithms.
14. 3.4. CLASSIFICATION 6
3.3.2.1 Pearson Corelation
Pearson’s Correlation coefficient is measured by how strong is the linear association between
features.
3.3.2.2 Univariate Process
Univariate feature selection works by choosing the best features based on univariate statis-
tical tests. It can be seen as a preprocessing step for an estimator.
3.3.2.3 Extra Tree Classifier
Extremely Randomized Trees Classifier is a type of ensemble learning technique that accu-
mulates the results of multiple de-correlated decision trees which are collected in a “forest”
to seek out the output of its classification result. Extra Tree Classifier is identical to Random
Forest Classifier and it only varies from it within the construction of the decision trees in the
forest.
3.4 Classification
3.4.1 K-Nearest Neighbor
K-Nearest Neighbor is an elementary algorithm that stores all the available cases and clas-
sifies the new data supported similarity measure. ‘K’ means the count of nearest neighbors
which are voting class of recent data. To see the smallest amount distant ‘k’ points, mathe-
matical equations like Euclidean distance, Manhattan distance, etc are used. it’s also called
Lazy Learner because it doesn’t have a discerning function from the training data. It learns
the training data and there’s no learning phase of the model.
3.4.2 Decision Tree
It is a tree shaped diagram accustomed to evaluate a course of action. Each branch of
tree represents a possible decision. It is used for both classification and regression. Clas-
sification is applied on discrete values while regression is applied on continuous values. A
classification tree will actuate a group of logical if-then conditions to classify problems while
regression tree is employed when the target value is numerical or continuous in nature.
15. 3.4. CLASSIFICATION 7
3.4.3 Support Vector Machine (SVM)
Support Vector Machine a supervised learning the method in which the model memorizes
from the past data and forms future prediction as output. It pursues on the labeled sample
data to find the decision boundary which produces the new unlabelled data and after that
the new data is plotted from which the new value is anticipated.
3.4.4 Gaussian Naive Bayes
Gaussian Naive Bayes is an alternative of Naive Bayes that pursues Gaussian normal distri-
bution and holds continuous data. During working with continuous data, an assumption
is picked up that the continuous values combined with each class are assigned according
to a Gaussian distribution. It holds continuous-valued features as well as models each as
integrating to a Gaussian distribution.
16. 8
Chapter 4
Methodology
In this paper, different models are introduced to predict whether an individual will have a
stroke or not based on several features like age, gender, hypertension, heart disease, ever
married, smoking status, work type, etc. The benchmark and the non-benchmark dataset are
trained on various machine learning algorithms and their performance is inspected to find
out which one would be the best to efficiently predict stroke. Fig 4.1 shows the flowchart
of the proposed model. Firstly, data collection is done ensured by some data pre-processing
steps to gain a cleaned dataset without null or duplicate values for better training and great
accuracy. After that, the dataset is split into training and testing data and fed into the
different classification models to find out the prediction. The confusion matrix along with
the performances of the models are obtained to find out the efficient algorithm that could
be used for the prediction.
18. 10
Chapter 5
Experiments and Results
The results obtained after applying Decision Tree, KNN, Gaussian Naïve Bayes and SVM
for both the dataset are shown in this section. The metrics used to carry out performance
analysis of the algorithm are Accuracy score, Precision (P), Recall (R) and F1-score. The
above-mentioned performance metrics are obtained using the confusion matrix which is
used to calculate the overall performance of the model.
Afetr that, Table 5.1, Table 5.2, Table 5.3, Table 5.4 show the recall values (For both bench-
mark and non-benchmark dataset) obtained from each of the machine learning models
along with four hybrid data balancing technique with as well as without feature engineer-
ing techniques. It can be observed from the tables that the highest recall value is got from
K-Nearest Neighbor and the worst recall value is obtained from Decision Tree.
Table 5.1 shows the recall values (For benchmark dataset) obtained from each of the ma-
chine learning models along with four hybrid data balancing technique without feature
engineering.
Table 5.2 shows the recall values (For benchmark dataset) obtained from each of the ma-
chine learning models along with four hybrid data balancing technique with feature engi-
neering techniques.
Moreover, Table 5.3 shows the recall values (For non-benchmark dataset) obtained from
each of the machine learning models along with four hybrid data balancing technique with-
out feature engineering technique.
Table 5.4 shows the recall values (For non-benchmark dataset) obtained from each of the
machine learning models along with four hybrid data balancing technique with feature en-
gineering techniques.
20. 12
(a) (b)
Figure 5.3: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.4: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.5: Classification Report and Confusion Matrix of KNN(K=2) using Borderline
SMOTE+Random Oversampling (Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.6: Classification Report and Confusion Matrix of SVM RBF(K=5) using
SVMSMOTE+Random Undersampling (Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.7: Classification Report and Confusion Matrix of SVM Linear using
SMOTE+TOMEK (Benchmark Dataset with Feature Engineering)
21. 13
(a) (b)
Figure 5.8: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.9: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) us-
ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset without Feature
Engineering)
(a) (b)
Figure 5.10: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-enchmark Dataset without Feature Engi-
neering)
(a) (b)
Figure 5.11: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.12: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Non-benchmark Dataset without Feature Engineering)
22. 14
(a) (b)
Figure 5.13: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=3) us-
ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset with Feature En-
gineering)
(a) (b)
Figure 5.14: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-benchmark Dataset with Feature Engi-
neering)
(a) (b)
Figure 5.15: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.16: Classification Report and Confusion Matrix of SVM Linear using SMOTE+ENN
(Non-benchmark Dataset with Feature Engineering)
23. 15
Chapter 6
Future Work and Conclusion
In this paper, Decision tree has given worst result in all cases. Gaussian Naive Bayes has
given a good recall value for label 1(stroke) in all cases. But most of the time, it has given
less recall value for label 0(no stroke) compared to label 1(stroke) and has also conferred
less accuracy.
Generally, KNN and SVM has given good results comparing to all other models shown in
Table 6.1. For SVM kernal=Poly, it has given a good result. For the non-benchmark dataset
which is a huge dataset, SMOTE+TOMEK and SMOTE+ENN which are hybrid oversampling
techique, has given better performance compared to Borderline SMOTE and SVMSMOTE.
Even Gaussian Naive Bayes has given a good recall value for label 1 (stroke) for Border-
line SMOTE and SVMSMOTE which is Good enough to compete with SMOTE+TOMEK and
SMOTE+ENN. But Gaussian Naive Bayes as mentioned earlier does not give a good recall
value for label 0 (no stroke) most of the time and also given low accuracy. So, overall
SMOTE+TOMEK and SMOTE+ENN has performed better for the non-benchmark dataset.
In this paper, a survey of different machine learning algorithms are used. In near future,this
paper will be developed by applying deep learning algorithms too.
Table 6.1: Result
Benchmark Dataset Without Feature Engineering Benchmark Dataset with Feature Engineering Non-benchmark Dataset without Feature Engineering Non-benchmark Dataset with Feature Engineering
BorderlineSMOTE+Random Undersampler
KNN
(K=5)
KNN (K=2) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=3)
SVMSMOTE+Random Undersampling Gaussian Naive Bayes (K=8) SVM RBF (K=5) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=1)
SMOTE+TOMEK SVM Poly SVM Linear SVM Poly SVM Poly
SMOTE+ENN SVM Poly SVM Poly SVM Poly SVM Linear
24. 16
References
[1] T. Rakshit and A. Shrestha, “Comparative analysis and implementation of heart stroke
prediction using various machine learning techniques,”
[2] H. Shashank, S. Srikanth, A. Thejas, et al., Prediction Of Stroke Using Machine Learning.
PhD thesis, CMR Institute of Technology. Bangalore, 2020.
25. Generated using Undegraduate Thesis L
A
TEX Template, Version 1.4. Department of Computer
Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
This project report was generated on Wednesday 29th
September, 2021 at 4:11pm.
17