SlideShare a Scribd company logo
1 of 25
Download to read offline
A Survey on Stroke Prediction
Drubojit Saha 170104027
Mohammad Rakib-Uz-Zaman 170104041
Tasnim Nusrat Hasan 170104046
Project Report
Course ID: CSE 4214
Course Name: Pattern Recognition Lab
Semester: Fall 2020
.
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
30 September 2021
A Survey on Stroke Prediction
Submitted by
Drubojit Saha 170104027
Mohammad Rakib-Uz-Zaman 170104041
Tasnim Nusrat Hasan 170104046
Submitted To
Faisal Muhammad Shah, Associate Professor
Farzad Ahmed, Lecturer
Md. Tanvir Rouf Shawon, Lecturer
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
.
Department of Computer Science and Engineering
Ahsanullah University of Science and Technology
Dhaka, Bangladesh
30 September 2021
ABSTRACT
Strokes have rapidly raised globally even at not only in older ages but also in juve-
nile ages. Stroke prediction is found to be a compound task which requires enormous
amount of data pre-processing also there is a need to automatize the prophecy process
for the early exposure of symptoms related to stroke so that it can be averted at an early
stage. In this research, stroke prediction is observed on two dataset collected from Kag-
gle (Both benchmark and non Benchmark dataset) in the proposed models. Different
models anticipate the chances a person will have stroke based on symptoms like age,
gender, average glucose level, smoking status, body mass index, work type and resi-
dence type. Each model analyzes the person’s risk level by performing various machine
learning algorithms like Gaussian Naive Bayes, K-Nearest Neighbor (KNN), Decision
Tree and Support Vector Machine (SVM). Hence, a provisional study is shown between
the different algorithms and the most competent one is obtained.
i
Contents
ABSTRACT i
List of Figures iv
List of Tables vi
1 Introduction 1
2 Literature Reviews 2
3 Data Collection & Processing 3
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Dataset Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Feature Engineering in Train Data . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1 Handling Imbalance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1.1 Borderline SMOTE+Random Undersampling . . . . . . . . . . 4
3.3.1.2 SVMSMOTE+Random Undersampling . . . . . . . . . . . . . . 5
3.3.1.3 SMOTE+Tomek . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1.4 SMOTE+ENN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.2.1 Pearson Corelation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.2.2 Univariate Process . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.2.3 Extra Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4.1 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4.3 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . 7
3.4.4 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Methodology 8
5 Experiments and Results 10
6 Future Work and Conclusion 15
ii
References 16
iii
List of Figures
3.1 Benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Non-benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Flow Chart of Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1 Classification Report and Confusion Matrix of KNN(K=5) using Borderline
SMOTE+Random Oversampling (Benchmark Dataset without Feature Engi-
neering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=8) us-
ing SVMSMOTE+Random Undersampling (Benchmark Dataset without Fea-
ture Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . . . . 12
5.4 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . . . . 12
5.5 Classification Report and Confusion Matrix of KNN(K=2) using Borderline
SMOTE+Random Oversampling (Benchmark Dataset with Feature Engineer-
ing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.6 Classification Report and Confusion Matrix of SVM RBF(K=5) using SVMSMOTE+Random
Undersampling (Benchmark Dataset with Feature Engineering) . . . . . . . . . 12
5.7 Classification Report and Confusion Matrix of SVM Linear using SMOTE+TOMEK
(Benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . 12
5.8 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . 13
5.9 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using Borderline SMOTE+Random Oversampling (Non-benchmark Dataset
without Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.10 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-enchmark Dataset with-
out Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.11 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . 13
iv
5.12 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Non-benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . 13
5.13 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=3)
using Borderline SMOTE+Random Oversampling (Non-benchmark Dataset
with Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.14 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-benchmark Dataset with
Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.15 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . 14
5.16 Classification Report and Confusion Matrix of SVM Linear using SMOTE+ENN
(Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . 14
v
List of Tables
5.1 Recall values for benchmark dataset without feature engineering . . . . . . . . 11
5.2 Recall values for benchmark dataset with feature engineering . . . . . . . . . . 11
5.3 Recall values for non-benchmark dataset without feature engineering . . . . . 11
5.4 Recall values for non-benchmark dataset with feature engineering . . . . . . . 11
6.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi
1
Chapter 1
Introduction
A stroke is a brain attack, cutting off crucial blood flow and oxygen to the brain. Stroke
arises when a blood artery feeding the brain gets clogged or bursts. Identifying stroke and
taking medical action immediately can not only lengthen life but also help to prevent heart
disease in the future.
Machine learning has become one of the most exacting field in modern technology. Stroke
prediction in grownups can be done by using various machine learning algorithms. It has
become a captivated research problem as there are various parameters that can effect the
outcome. The causes consist of work type, glucose level, body mass index, gender, residence
type, age, average smoking status of the individual and any previous heart disease.
The proposed models foresee stroke prediction of several individuals using various machine
learning algorithms like K-Nearest Neighbors, Decision Tree Classifier, Support Vector Ma-
chine, and Gaussian Naive Bayes based on features mentioned above which have been taken
from the dataset on which the model has been trained.
2
Chapter 2
Literature Reviews
In [1] various classification algorithms are studied and the most accurate model is obtained
for predicting the stroke in the patient. It was found that Decision Tree and SVM were the
most efficient algorithms while KNN was found to be the most ineffective one.
In [2] different oversampling techniques are mentioned for imbalance dataset. This survey
work also finds out some of the major flaws associated with stroke related topics, so that a
suitable solution can be proposed in order to overcome the disease.
3
Chapter 3
Data Collection & Processing
3.1 Dataset
The datasets have been taken from the Kaggle website (Both Benchmark and non Benchmark
Dataset). Fig 3.1 shows the benchmark dataset has 5110 rows and 11 columns. The features
are consist of gender, age, work type, residence type, average glucose level, body mass index
(BMI), hypertension, heart disease, ever married, smoking status. The target column is
stroke. On the other hand, Fig 3.2 shows the non benchmark dataset that has 31652 rows
and 11 columns. The attributes are same as the benchmark dataset. The target column is
stroke. All the other features for both the dataset have been used for training the model
except the id.
Figure 3.1: Benchmark Dataset
3.2. DATASET PRE-PROCESSING 4
Figure 3.2: Non-benchmark Dataset
3.2 Dataset Pre-processing
The benchmark dataset contains 201 null value in BMI attribute which are replaced by
calculating the mean value of BMI. Lebel encoding technique is used for converting the
gender, ever married, work type, resident type, smoking status features label’s into a numeric
value so that the machine can read the values in a machine-readable form. The attribute
named average glucose level and BMI data are normalized which means it chanes the values
of numeric columns to a common scale between 0 to 1. Then the both the dataset are spillted
into 80 to 20 ratio.
3.3 Feature Engineering in Train Data
Feature engineering indicates to the transformation of using domain knowledge to select
and convert the most compatible features from raw data when making a predictive model
using machine learning.
3.3.1 Handling Imbalance Data
3.3.1.1 Borderline SMOTE+Random Undersampling
Borderline SMOTE algorithm is used for oversampling from minority class with undersam-
pling algorithm Random undersampling as hybrid technique for Balancing the imbalance
data. Borderline SMOTE algorithm starts by classifying the minority class observations. It
uses KNN algorithm for this classification. It classifies any minority observation as a noise
point if all the neighbors of that point are the majority class and such an observation is ig-
3.3. FEATURE ENGINEERING IN TRAIN DATA 5
nored while creating synthetic data. It classifies some points as borders points that have both
majority and minority class as neighborhood and resample completely from these points.
Random undersampling works by randomly selecting observations from the majority class
and deleting them from the training dataset.
3.3.1.2 SVMSMOTE+Random Undersampling
SVMSMOTE algorithm is used for oversampling from minority class with undersampling
algorithm Random undersampling as hybrid technique for Balancing the imbalance data.
SVMSMOTE is an alternative of Borderline SMOTE as it uses SVM algorithm instead of KNN
for classifying the minority class observations as mentioned in the above section. Except this
difference, this algorithm works same as Borderline SMOTE algorithm as mentioned in the
above section. Random undersampling works by randomly selecting observations from the
majority class and deleting them from the training dataset.
3.3.1.3 SMOTE+Tomek
SMOTE+TOMEK is a hybrid technique that focuses to clean overlapping data points for each
of the classes assigned in sample space. Tomek links are anointed to oversampled minority
class samples which is done by SMOTE. As a result, rather than discarding the observations
only from the majority class, it eliminate both the class observations from the Tomek links.
3.3.1.4 SMOTE+ENN
SMOTE + ENN is other type of hybrid technique where a major number of observations are
cleared from the sample space. Though ENN is another undersampling technique where
the adjacent neighbors of each of the majority class is measured. If the adjacent neighbors
misclassify that apecific instance of the majority class, then the instance is eliminated.
3.3.2 Feature Engineering
Feature engineering introduces to the procedure of using domain knowledge to choose and
transform the most relevant features when establishing a predictive model using machine
learning model. Its aim is to improve the performance of machine learning (ML) algorithms.
3.4. CLASSIFICATION 6
3.3.2.1 Pearson Corelation
Pearson’s Correlation coefficient is measured by how strong is the linear association between
features.
3.3.2.2 Univariate Process
Univariate feature selection works by choosing the best features based on univariate statis-
tical tests. It can be seen as a preprocessing step for an estimator.
3.3.2.3 Extra Tree Classifier
Extremely Randomized Trees Classifier is a type of ensemble learning technique that accu-
mulates the results of multiple de-correlated decision trees which are collected in a “forest”
to seek out the output of its classification result. Extra Tree Classifier is identical to Random
Forest Classifier and it only varies from it within the construction of the decision trees in the
forest.
3.4 Classification
3.4.1 K-Nearest Neighbor
K-Nearest Neighbor is an elementary algorithm that stores all the available cases and clas-
sifies the new data supported similarity measure. ‘K’ means the count of nearest neighbors
which are voting class of recent data. To see the smallest amount distant ‘k’ points, mathe-
matical equations like Euclidean distance, Manhattan distance, etc are used. it’s also called
Lazy Learner because it doesn’t have a discerning function from the training data. It learns
the training data and there’s no learning phase of the model.
3.4.2 Decision Tree
It is a tree shaped diagram accustomed to evaluate a course of action. Each branch of
tree represents a possible decision. It is used for both classification and regression. Clas-
sification is applied on discrete values while regression is applied on continuous values. A
classification tree will actuate a group of logical if-then conditions to classify problems while
regression tree is employed when the target value is numerical or continuous in nature.
3.4. CLASSIFICATION 7
3.4.3 Support Vector Machine (SVM)
Support Vector Machine a supervised learning the method in which the model memorizes
from the past data and forms future prediction as output. It pursues on the labeled sample
data to find the decision boundary which produces the new unlabelled data and after that
the new data is plotted from which the new value is anticipated.
3.4.4 Gaussian Naive Bayes
Gaussian Naive Bayes is an alternative of Naive Bayes that pursues Gaussian normal distri-
bution and holds continuous data. During working with continuous data, an assumption
is picked up that the continuous values combined with each class are assigned according
to a Gaussian distribution. It holds continuous-valued features as well as models each as
integrating to a Gaussian distribution.
8
Chapter 4
Methodology
In this paper, different models are introduced to predict whether an individual will have a
stroke or not based on several features like age, gender, hypertension, heart disease, ever
married, smoking status, work type, etc. The benchmark and the non-benchmark dataset are
trained on various machine learning algorithms and their performance is inspected to find
out which one would be the best to efficiently predict stroke. Fig 4.1 shows the flowchart
of the proposed model. Firstly, data collection is done ensured by some data pre-processing
steps to gain a cleaned dataset without null or duplicate values for better training and great
accuracy. After that, the dataset is split into training and testing data and fed into the
different classification models to find out the prediction. The confusion matrix along with
the performances of the models are obtained to find out the efficient algorithm that could
be used for the prediction.
9
Figure 4.1: Flow Chart of Proposed Model
10
Chapter 5
Experiments and Results
The results obtained after applying Decision Tree, KNN, Gaussian Naïve Bayes and SVM
for both the dataset are shown in this section. The metrics used to carry out performance
analysis of the algorithm are Accuracy score, Precision (P), Recall (R) and F1-score. The
above-mentioned performance metrics are obtained using the confusion matrix which is
used to calculate the overall performance of the model.
Afetr that, Table 5.1, Table 5.2, Table 5.3, Table 5.4 show the recall values (For both bench-
mark and non-benchmark dataset) obtained from each of the machine learning models
along with four hybrid data balancing technique with as well as without feature engineer-
ing techniques. It can be observed from the tables that the highest recall value is got from
K-Nearest Neighbor and the worst recall value is obtained from Decision Tree.
Table 5.1 shows the recall values (For benchmark dataset) obtained from each of the ma-
chine learning models along with four hybrid data balancing technique without feature
engineering.
Table 5.2 shows the recall values (For benchmark dataset) obtained from each of the ma-
chine learning models along with four hybrid data balancing technique with feature engi-
neering techniques.
Moreover, Table 5.3 shows the recall values (For non-benchmark dataset) obtained from
each of the machine learning models along with four hybrid data balancing technique with-
out feature engineering technique.
Table 5.4 shows the recall values (For non-benchmark dataset) obtained from each of the
machine learning models along with four hybrid data balancing technique with feature en-
gineering techniques.
11
Table 5.1: Recall values for benchmark dataset without feature engineering
Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor
0 1 0 1 0 1 0 1 0 1 0 1
Borderline SMOTE+Random Undersampling 0.82 0.56 0.75 0.81 0.8 0.75 0.85 0.65 0.83 0.63 0.82 0.75
SVM SMOTE+Random Undersampling 0.79 0.63 0.75 0.79 0.79 0.77 0.85 0.65 0.82 0.69 0.78 0.79
SMOTE+Tomek 0.87 0.35 0.61 0.83 0.72 0.71 0.8 0.75 0.85 0.5 0.78 0.67
SMOTE+ENN 0.86 0.5 0.65 0.87 0.69 0.85 0.72 0.83 0.78 0.65 0.76 0.73
Table 5.2: Recall values for benchmark dataset with feature engineering
Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor
0 1 0 1 0 1 0 1 0 1 0 1
Borderline SMOTE+Random Undersampling 0.56 0.8 0.74 0.81 0.79 0.73 0.85 0.65 0.81 0.77 0.81 0.79
SVM SMOTE+Random Undersampling 0.77 0.65 0.77 0.77 0.8 0.73 0.85 0.65 0.76 0.79 0.79 0.77
SMOTE+Tomek 0.87 0.37 0.6 0.83 0.71 0.79 0.79 0.69 0.8 0.62 0.81 0.6
SMOTE+ENN 0.87 0.38 0.64 0.85 0.69 0.83 0.74 0.81 0.79 0.69 0.79 0.73
Table 5.3: Recall values for non-benchmark dataset without feature engineering
Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor
0 1 0 1 0 1 0 1 0 1 0 1
Borderline SMOTE+Random Undersampling 0.87 0.4 0.75 0.76 0.81 0.66 0.84 0.6 0.87 0.44 0.87 0.47
SVM SMOTE+Random Undersampling 0.84 0.48 0.77 0.76 0.8 0.68 0.84 0.6 0.85 0.54 0.85 0.5
SMOTE+Tomek 0.92 0.21 0.61 0.81 0.78 0.67 0.78 0.71 0.78 0.48 0.85 0.37
SMOTE+ENN 0.9 0.21 0.63 0.79 0.74 0.73 0.76 0.73 0.78 0.47 0.82 0.45
Table 5.4: Recall values for non-benchmark dataset with feature engineering
Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor
0 1 0 1 0 1 0 1 0 1 0 1
Borderline SMOTE+Random Undersampling 0.86 0.43 0.76 0.74 0.81 0.67 0.84 0.61 0.87 0.51 0.87 0.44
SVM SMOTE+Random Undersampling 0.81 0.47 0.76 0.75 0.8 0.69 0.84 0.63 0.84 0.59 0.84 0.53
SMOTE+Tomek 0.91 0.2 0.61 0.81 0.77 0.7 0.77 0.7 0.75 0.6 0.86 0.3
SMOTE+ENN 0.89 0.26 0.62 0.8 0.74 0.76 0.75 0.74 0.75 0.58 0.83 0.38
(a) (b)
Figure 5.1: Classification Report and Confusion Matrix of KNN(K=5) using Borderline
SMOTE+Random Oversampling (Benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.2: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=8) using
SVMSMOTE+Random Undersampling (Benchmark Dataset without Feature Engineering)
12
(a) (b)
Figure 5.3: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.4: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.5: Classification Report and Confusion Matrix of KNN(K=2) using Borderline
SMOTE+Random Oversampling (Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.6: Classification Report and Confusion Matrix of SVM RBF(K=5) using
SVMSMOTE+Random Undersampling (Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.7: Classification Report and Confusion Matrix of SVM Linear using
SMOTE+TOMEK (Benchmark Dataset with Feature Engineering)
13
(a) (b)
Figure 5.8: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.9: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) us-
ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset without Feature
Engineering)
(a) (b)
Figure 5.10: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-enchmark Dataset without Feature Engi-
neering)
(a) (b)
Figure 5.11: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset without Feature Engineering)
(a) (b)
Figure 5.12: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN
(Non-benchmark Dataset without Feature Engineering)
14
(a) (b)
Figure 5.13: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=3) us-
ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset with Feature En-
gineering)
(a) (b)
Figure 5.14: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1)
using SVMSMOTE+Random Undersampling (Non-benchmark Dataset with Feature Engi-
neering)
(a) (b)
Figure 5.15: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK
(Non-benchmark Dataset with Feature Engineering)
(a) (b)
Figure 5.16: Classification Report and Confusion Matrix of SVM Linear using SMOTE+ENN
(Non-benchmark Dataset with Feature Engineering)
15
Chapter 6
Future Work and Conclusion
In this paper, Decision tree has given worst result in all cases. Gaussian Naive Bayes has
given a good recall value for label 1(stroke) in all cases. But most of the time, it has given
less recall value for label 0(no stroke) compared to label 1(stroke) and has also conferred
less accuracy.
Generally, KNN and SVM has given good results comparing to all other models shown in
Table 6.1. For SVM kernal=Poly, it has given a good result. For the non-benchmark dataset
which is a huge dataset, SMOTE+TOMEK and SMOTE+ENN which are hybrid oversampling
techique, has given better performance compared to Borderline SMOTE and SVMSMOTE.
Even Gaussian Naive Bayes has given a good recall value for label 1 (stroke) for Border-
line SMOTE and SVMSMOTE which is Good enough to compete with SMOTE+TOMEK and
SMOTE+ENN. But Gaussian Naive Bayes as mentioned earlier does not give a good recall
value for label 0 (no stroke) most of the time and also given low accuracy. So, overall
SMOTE+TOMEK and SMOTE+ENN has performed better for the non-benchmark dataset.
In this paper, a survey of different machine learning algorithms are used. In near future,this
paper will be developed by applying deep learning algorithms too.
Table 6.1: Result
Benchmark Dataset Without Feature Engineering Benchmark Dataset with Feature Engineering Non-benchmark Dataset without Feature Engineering Non-benchmark Dataset with Feature Engineering
BorderlineSMOTE+Random Undersampler
KNN
(K=5)
KNN (K=2) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=3)
SVMSMOTE+Random Undersampling Gaussian Naive Bayes (K=8) SVM RBF (K=5) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=1)
SMOTE+TOMEK SVM Poly SVM Linear SVM Poly SVM Poly
SMOTE+ENN SVM Poly SVM Poly SVM Poly SVM Linear
16
References
[1] T. Rakshit and A. Shrestha, “Comparative analysis and implementation of heart stroke
prediction using various machine learning techniques,”
[2] H. Shashank, S. Srikanth, A. Thejas, et al., Prediction Of Stroke Using Machine Learning.
PhD thesis, CMR Institute of Technology. Bangalore, 2020.
Generated using Undegraduate Thesis L
A
TEX Template, Version 1.4. Department of Computer
Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
This project report was generated on Wednesday 29th
September, 2021 at 4:11pm.
17

More Related Content

What's hot

final.pptx
final.pptxfinal.pptx
final.pptxyogha8
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Kedar Damkondwar
 
BRAIN STROKE REVIEW.pptx
BRAIN STROKE REVIEW.pptxBRAIN STROKE REVIEW.pptx
BRAIN STROKE REVIEW.pptxyogha8
 
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptxCardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptxTaminul Islam
 
Heart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning AlgorithmHeart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning Algorithmijtsrd
 
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learningmohdshoaibuddin1
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
HPPS: Heart Problem Prediction System using Machine Learning
HPPS: Heart Problem Prediction System using Machine LearningHPPS: Heart Problem Prediction System using Machine Learning
HPPS: Heart Problem Prediction System using Machine LearningNimai Chand Das Adhikari
 
INTRODUCTION TO MACHINE LEARNING.pptx
INTRODUCTION TO MACHINE LEARNING.pptxINTRODUCTION TO MACHINE LEARNING.pptx
INTRODUCTION TO MACHINE LEARNING.pptxAbhigyanMishra17
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxShubham Jaybhaye
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 
Heart disease prediction
Heart disease predictionHeart disease prediction
Heart disease predictionAriful Haque
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxKunal Kalamkar
 

What's hot (20)

final.pptx
final.pptxfinal.pptx
final.pptx
 
Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm Heart disease prediction using machine learning algorithm
Heart disease prediction using machine learning algorithm
 
BRAIN STROKE REVIEW.pptx
BRAIN STROKE REVIEW.pptxBRAIN STROKE REVIEW.pptx
BRAIN STROKE REVIEW.pptx
 
Final ppt
Final pptFinal ppt
Final ppt
 
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptxCardiovascular Disease Prediction Using Machine Learning Approaches.pptx
Cardiovascular Disease Prediction Using Machine Learning Approaches.pptx
 
Heart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning AlgorithmHeart Disease Prediction using Machine Learning Algorithm
Heart Disease Prediction using Machine Learning Algorithm
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Heart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine LearningHeart Attack Prediction using Machine Learning
Heart Attack Prediction using Machine Learning
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
HPPS: Heart Problem Prediction System using Machine Learning
HPPS: Heart Problem Prediction System using Machine LearningHPPS: Heart Problem Prediction System using Machine Learning
HPPS: Heart Problem Prediction System using Machine Learning
 
INTRODUCTION TO MACHINE LEARNING.pptx
INTRODUCTION TO MACHINE LEARNING.pptxINTRODUCTION TO MACHINE LEARNING.pptx
INTRODUCTION TO MACHINE LEARNING.pptx
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 
Heart disease prediction
Heart disease predictionHeart disease prediction
Heart disease prediction
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptx
 

Similar to A Survey on Stroke Prediction

Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionMatthieu Cisel
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 
Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Sebastian
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingAndrea Tino
 
Pawar-Ajinkya-MASc-MECH-December-2016
Pawar-Ajinkya-MASc-MECH-December-2016Pawar-Ajinkya-MASc-MECH-December-2016
Pawar-Ajinkya-MASc-MECH-December-2016Ajinkya Pawar
 
project Report on LAN Security Manager
project Report on LAN Security Managerproject Report on LAN Security Manager
project Report on LAN Security ManagerShahrikh Khan
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management frameworkSaurabh Nambiar
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisBryan Collazo Santiago
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmKavita Pillai
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOKmusadoto
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...Man_Ebook
 

Similar to A Survey on Stroke Prediction (20)

Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
 
Pawar-Ajinkya-MASc-MECH-December-2016
Pawar-Ajinkya-MASc-MECH-December-2016Pawar-Ajinkya-MASc-MECH-December-2016
Pawar-Ajinkya-MASc-MECH-December-2016
 
xlelke00
xlelke00xlelke00
xlelke00
 
project Report on LAN Security Manager
project Report on LAN Security Managerproject Report on LAN Security Manager
project Report on LAN Security Manager
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
AnthonyPioli-Thesis
AnthonyPioli-ThesisAnthonyPioli-Thesis
AnthonyPioli-Thesis
 
Sarda_uta_2502M_12076
Sarda_uta_2502M_12076Sarda_uta_2502M_12076
Sarda_uta_2502M_12076
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_Thesis
 
T401
T401T401
T401
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
 
Tutorial
TutorialTutorial
Tutorial
 
Mth201 COMPLETE BOOK
Mth201 COMPLETE BOOKMth201 COMPLETE BOOK
Mth201 COMPLETE BOOK
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
A study on improving speaker diarization system = Nghiên cứu phương pháp cải ...
 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

A Survey on Stroke Prediction

  • 1. A Survey on Stroke Prediction Drubojit Saha 170104027 Mohammad Rakib-Uz-Zaman 170104041 Tasnim Nusrat Hasan 170104046 Project Report Course ID: CSE 4214 Course Name: Pattern Recognition Lab Semester: Fall 2020 . Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh 30 September 2021
  • 2. A Survey on Stroke Prediction Submitted by Drubojit Saha 170104027 Mohammad Rakib-Uz-Zaman 170104041 Tasnim Nusrat Hasan 170104046 Submitted To Faisal Muhammad Shah, Associate Professor Farzad Ahmed, Lecturer Md. Tanvir Rouf Shawon, Lecturer Department of Computer Science and Engineering Ahsanullah University of Science and Technology . Department of Computer Science and Engineering Ahsanullah University of Science and Technology Dhaka, Bangladesh 30 September 2021
  • 3. ABSTRACT Strokes have rapidly raised globally even at not only in older ages but also in juve- nile ages. Stroke prediction is found to be a compound task which requires enormous amount of data pre-processing also there is a need to automatize the prophecy process for the early exposure of symptoms related to stroke so that it can be averted at an early stage. In this research, stroke prediction is observed on two dataset collected from Kag- gle (Both benchmark and non Benchmark dataset) in the proposed models. Different models anticipate the chances a person will have stroke based on symptoms like age, gender, average glucose level, smoking status, body mass index, work type and resi- dence type. Each model analyzes the person’s risk level by performing various machine learning algorithms like Gaussian Naive Bayes, K-Nearest Neighbor (KNN), Decision Tree and Support Vector Machine (SVM). Hence, a provisional study is shown between the different algorithms and the most competent one is obtained. i
  • 4. Contents ABSTRACT i List of Figures iv List of Tables vi 1 Introduction 1 2 Literature Reviews 2 3 Data Collection & Processing 3 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Dataset Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Feature Engineering in Train Data . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3.1 Handling Imbalance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3.1.1 Borderline SMOTE+Random Undersampling . . . . . . . . . . 4 3.3.1.2 SVMSMOTE+Random Undersampling . . . . . . . . . . . . . . 5 3.3.1.3 SMOTE+Tomek . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.1.4 SMOTE+ENN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.2.1 Pearson Corelation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3.2.2 Univariate Process . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3.2.3 Extra Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4.1 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4.3 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . 7 3.4.4 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Methodology 8 5 Experiments and Results 10 6 Future Work and Conclusion 15 ii
  • 6. List of Figures 3.1 Benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Non-benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1 Flow Chart of Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1 Classification Report and Confusion Matrix of KNN(K=5) using Borderline SMOTE+Random Oversampling (Benchmark Dataset without Feature Engi- neering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=8) us- ing SVMSMOTE+Random Undersampling (Benchmark Dataset without Fea- ture Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . . . . 12 5.4 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . . . . 12 5.5 Classification Report and Confusion Matrix of KNN(K=2) using Borderline SMOTE+Random Oversampling (Benchmark Dataset with Feature Engineer- ing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.6 Classification Report and Confusion Matrix of SVM RBF(K=5) using SVMSMOTE+Random Undersampling (Benchmark Dataset with Feature Engineering) . . . . . . . . . 12 5.7 Classification Report and Confusion Matrix of SVM Linear using SMOTE+TOMEK (Benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . 12 5.8 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . 13 5.9 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) using Borderline SMOTE+Random Oversampling (Non-benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.10 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) using SVMSMOTE+Random Undersampling (Non-enchmark Dataset with- out Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.11 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Non-benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . 13 iv
  • 7. 5.12 Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Non-benchmark Dataset without Feature Engineering) . . . . . . . . . . . . . 13 5.13 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=3) using Borderline SMOTE+Random Oversampling (Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.14 Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) using SVMSMOTE+Random Undersampling (Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.15 Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . 14 5.16 Classification Report and Confusion Matrix of SVM Linear using SMOTE+ENN (Non-benchmark Dataset with Feature Engineering) . . . . . . . . . . . . . . . . 14 v
  • 8. List of Tables 5.1 Recall values for benchmark dataset without feature engineering . . . . . . . . 11 5.2 Recall values for benchmark dataset with feature engineering . . . . . . . . . . 11 5.3 Recall values for non-benchmark dataset without feature engineering . . . . . 11 5.4 Recall values for non-benchmark dataset with feature engineering . . . . . . . 11 6.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 vi
  • 9. 1 Chapter 1 Introduction A stroke is a brain attack, cutting off crucial blood flow and oxygen to the brain. Stroke arises when a blood artery feeding the brain gets clogged or bursts. Identifying stroke and taking medical action immediately can not only lengthen life but also help to prevent heart disease in the future. Machine learning has become one of the most exacting field in modern technology. Stroke prediction in grownups can be done by using various machine learning algorithms. It has become a captivated research problem as there are various parameters that can effect the outcome. The causes consist of work type, glucose level, body mass index, gender, residence type, age, average smoking status of the individual and any previous heart disease. The proposed models foresee stroke prediction of several individuals using various machine learning algorithms like K-Nearest Neighbors, Decision Tree Classifier, Support Vector Ma- chine, and Gaussian Naive Bayes based on features mentioned above which have been taken from the dataset on which the model has been trained.
  • 10. 2 Chapter 2 Literature Reviews In [1] various classification algorithms are studied and the most accurate model is obtained for predicting the stroke in the patient. It was found that Decision Tree and SVM were the most efficient algorithms while KNN was found to be the most ineffective one. In [2] different oversampling techniques are mentioned for imbalance dataset. This survey work also finds out some of the major flaws associated with stroke related topics, so that a suitable solution can be proposed in order to overcome the disease.
  • 11. 3 Chapter 3 Data Collection & Processing 3.1 Dataset The datasets have been taken from the Kaggle website (Both Benchmark and non Benchmark Dataset). Fig 3.1 shows the benchmark dataset has 5110 rows and 11 columns. The features are consist of gender, age, work type, residence type, average glucose level, body mass index (BMI), hypertension, heart disease, ever married, smoking status. The target column is stroke. On the other hand, Fig 3.2 shows the non benchmark dataset that has 31652 rows and 11 columns. The attributes are same as the benchmark dataset. The target column is stroke. All the other features for both the dataset have been used for training the model except the id. Figure 3.1: Benchmark Dataset
  • 12. 3.2. DATASET PRE-PROCESSING 4 Figure 3.2: Non-benchmark Dataset 3.2 Dataset Pre-processing The benchmark dataset contains 201 null value in BMI attribute which are replaced by calculating the mean value of BMI. Lebel encoding technique is used for converting the gender, ever married, work type, resident type, smoking status features label’s into a numeric value so that the machine can read the values in a machine-readable form. The attribute named average glucose level and BMI data are normalized which means it chanes the values of numeric columns to a common scale between 0 to 1. Then the both the dataset are spillted into 80 to 20 ratio. 3.3 Feature Engineering in Train Data Feature engineering indicates to the transformation of using domain knowledge to select and convert the most compatible features from raw data when making a predictive model using machine learning. 3.3.1 Handling Imbalance Data 3.3.1.1 Borderline SMOTE+Random Undersampling Borderline SMOTE algorithm is used for oversampling from minority class with undersam- pling algorithm Random undersampling as hybrid technique for Balancing the imbalance data. Borderline SMOTE algorithm starts by classifying the minority class observations. It uses KNN algorithm for this classification. It classifies any minority observation as a noise point if all the neighbors of that point are the majority class and such an observation is ig-
  • 13. 3.3. FEATURE ENGINEERING IN TRAIN DATA 5 nored while creating synthetic data. It classifies some points as borders points that have both majority and minority class as neighborhood and resample completely from these points. Random undersampling works by randomly selecting observations from the majority class and deleting them from the training dataset. 3.3.1.2 SVMSMOTE+Random Undersampling SVMSMOTE algorithm is used for oversampling from minority class with undersampling algorithm Random undersampling as hybrid technique for Balancing the imbalance data. SVMSMOTE is an alternative of Borderline SMOTE as it uses SVM algorithm instead of KNN for classifying the minority class observations as mentioned in the above section. Except this difference, this algorithm works same as Borderline SMOTE algorithm as mentioned in the above section. Random undersampling works by randomly selecting observations from the majority class and deleting them from the training dataset. 3.3.1.3 SMOTE+Tomek SMOTE+TOMEK is a hybrid technique that focuses to clean overlapping data points for each of the classes assigned in sample space. Tomek links are anointed to oversampled minority class samples which is done by SMOTE. As a result, rather than discarding the observations only from the majority class, it eliminate both the class observations from the Tomek links. 3.3.1.4 SMOTE+ENN SMOTE + ENN is other type of hybrid technique where a major number of observations are cleared from the sample space. Though ENN is another undersampling technique where the adjacent neighbors of each of the majority class is measured. If the adjacent neighbors misclassify that apecific instance of the majority class, then the instance is eliminated. 3.3.2 Feature Engineering Feature engineering introduces to the procedure of using domain knowledge to choose and transform the most relevant features when establishing a predictive model using machine learning model. Its aim is to improve the performance of machine learning (ML) algorithms.
  • 14. 3.4. CLASSIFICATION 6 3.3.2.1 Pearson Corelation Pearson’s Correlation coefficient is measured by how strong is the linear association between features. 3.3.2.2 Univariate Process Univariate feature selection works by choosing the best features based on univariate statis- tical tests. It can be seen as a preprocessing step for an estimator. 3.3.2.3 Extra Tree Classifier Extremely Randomized Trees Classifier is a type of ensemble learning technique that accu- mulates the results of multiple de-correlated decision trees which are collected in a “forest” to seek out the output of its classification result. Extra Tree Classifier is identical to Random Forest Classifier and it only varies from it within the construction of the decision trees in the forest. 3.4 Classification 3.4.1 K-Nearest Neighbor K-Nearest Neighbor is an elementary algorithm that stores all the available cases and clas- sifies the new data supported similarity measure. ‘K’ means the count of nearest neighbors which are voting class of recent data. To see the smallest amount distant ‘k’ points, mathe- matical equations like Euclidean distance, Manhattan distance, etc are used. it’s also called Lazy Learner because it doesn’t have a discerning function from the training data. It learns the training data and there’s no learning phase of the model. 3.4.2 Decision Tree It is a tree shaped diagram accustomed to evaluate a course of action. Each branch of tree represents a possible decision. It is used for both classification and regression. Clas- sification is applied on discrete values while regression is applied on continuous values. A classification tree will actuate a group of logical if-then conditions to classify problems while regression tree is employed when the target value is numerical or continuous in nature.
  • 15. 3.4. CLASSIFICATION 7 3.4.3 Support Vector Machine (SVM) Support Vector Machine a supervised learning the method in which the model memorizes from the past data and forms future prediction as output. It pursues on the labeled sample data to find the decision boundary which produces the new unlabelled data and after that the new data is plotted from which the new value is anticipated. 3.4.4 Gaussian Naive Bayes Gaussian Naive Bayes is an alternative of Naive Bayes that pursues Gaussian normal distri- bution and holds continuous data. During working with continuous data, an assumption is picked up that the continuous values combined with each class are assigned according to a Gaussian distribution. It holds continuous-valued features as well as models each as integrating to a Gaussian distribution.
  • 16. 8 Chapter 4 Methodology In this paper, different models are introduced to predict whether an individual will have a stroke or not based on several features like age, gender, hypertension, heart disease, ever married, smoking status, work type, etc. The benchmark and the non-benchmark dataset are trained on various machine learning algorithms and their performance is inspected to find out which one would be the best to efficiently predict stroke. Fig 4.1 shows the flowchart of the proposed model. Firstly, data collection is done ensured by some data pre-processing steps to gain a cleaned dataset without null or duplicate values for better training and great accuracy. After that, the dataset is split into training and testing data and fed into the different classification models to find out the prediction. The confusion matrix along with the performances of the models are obtained to find out the efficient algorithm that could be used for the prediction.
  • 17. 9 Figure 4.1: Flow Chart of Proposed Model
  • 18. 10 Chapter 5 Experiments and Results The results obtained after applying Decision Tree, KNN, Gaussian Naïve Bayes and SVM for both the dataset are shown in this section. The metrics used to carry out performance analysis of the algorithm are Accuracy score, Precision (P), Recall (R) and F1-score. The above-mentioned performance metrics are obtained using the confusion matrix which is used to calculate the overall performance of the model. Afetr that, Table 5.1, Table 5.2, Table 5.3, Table 5.4 show the recall values (For both bench- mark and non-benchmark dataset) obtained from each of the machine learning models along with four hybrid data balancing technique with as well as without feature engineer- ing techniques. It can be observed from the tables that the highest recall value is got from K-Nearest Neighbor and the worst recall value is obtained from Decision Tree. Table 5.1 shows the recall values (For benchmark dataset) obtained from each of the ma- chine learning models along with four hybrid data balancing technique without feature engineering. Table 5.2 shows the recall values (For benchmark dataset) obtained from each of the ma- chine learning models along with four hybrid data balancing technique with feature engi- neering techniques. Moreover, Table 5.3 shows the recall values (For non-benchmark dataset) obtained from each of the machine learning models along with four hybrid data balancing technique with- out feature engineering technique. Table 5.4 shows the recall values (For non-benchmark dataset) obtained from each of the machine learning models along with four hybrid data balancing technique with feature en- gineering techniques.
  • 19. 11 Table 5.1: Recall values for benchmark dataset without feature engineering Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor 0 1 0 1 0 1 0 1 0 1 0 1 Borderline SMOTE+Random Undersampling 0.82 0.56 0.75 0.81 0.8 0.75 0.85 0.65 0.83 0.63 0.82 0.75 SVM SMOTE+Random Undersampling 0.79 0.63 0.75 0.79 0.79 0.77 0.85 0.65 0.82 0.69 0.78 0.79 SMOTE+Tomek 0.87 0.35 0.61 0.83 0.72 0.71 0.8 0.75 0.85 0.5 0.78 0.67 SMOTE+ENN 0.86 0.5 0.65 0.87 0.69 0.85 0.72 0.83 0.78 0.65 0.76 0.73 Table 5.2: Recall values for benchmark dataset with feature engineering Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor 0 1 0 1 0 1 0 1 0 1 0 1 Borderline SMOTE+Random Undersampling 0.56 0.8 0.74 0.81 0.79 0.73 0.85 0.65 0.81 0.77 0.81 0.79 SVM SMOTE+Random Undersampling 0.77 0.65 0.77 0.77 0.8 0.73 0.85 0.65 0.76 0.79 0.79 0.77 SMOTE+Tomek 0.87 0.37 0.6 0.83 0.71 0.79 0.79 0.69 0.8 0.62 0.81 0.6 SMOTE+ENN 0.87 0.38 0.64 0.85 0.69 0.83 0.74 0.81 0.79 0.69 0.79 0.73 Table 5.3: Recall values for non-benchmark dataset without feature engineering Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor 0 1 0 1 0 1 0 1 0 1 0 1 Borderline SMOTE+Random Undersampling 0.87 0.4 0.75 0.76 0.81 0.66 0.84 0.6 0.87 0.44 0.87 0.47 SVM SMOTE+Random Undersampling 0.84 0.48 0.77 0.76 0.8 0.68 0.84 0.6 0.85 0.54 0.85 0.5 SMOTE+Tomek 0.92 0.21 0.61 0.81 0.78 0.67 0.78 0.71 0.78 0.48 0.85 0.37 SMOTE+ENN 0.9 0.21 0.63 0.79 0.74 0.73 0.76 0.73 0.78 0.47 0.82 0.45 Table 5.4: Recall values for non-benchmark dataset with feature engineering Decision Tree Gaussian Naive Bayes SVM Kernel (Linear) SVM Kernel (Poly) SVM Kernel (RBF) K-Nearest Neighbor 0 1 0 1 0 1 0 1 0 1 0 1 Borderline SMOTE+Random Undersampling 0.86 0.43 0.76 0.74 0.81 0.67 0.84 0.61 0.87 0.51 0.87 0.44 SVM SMOTE+Random Undersampling 0.81 0.47 0.76 0.75 0.8 0.69 0.84 0.63 0.84 0.59 0.84 0.53 SMOTE+Tomek 0.91 0.2 0.61 0.81 0.77 0.7 0.77 0.7 0.75 0.6 0.86 0.3 SMOTE+ENN 0.89 0.26 0.62 0.8 0.74 0.76 0.75 0.74 0.75 0.58 0.83 0.38 (a) (b) Figure 5.1: Classification Report and Confusion Matrix of KNN(K=5) using Borderline SMOTE+Random Oversampling (Benchmark Dataset without Feature Engineering) (a) (b) Figure 5.2: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=8) using SVMSMOTE+Random Undersampling (Benchmark Dataset without Feature Engineering)
  • 20. 12 (a) (b) Figure 5.3: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Benchmark Dataset without Feature Engineering) (a) (b) Figure 5.4: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Benchmark Dataset without Feature Engineering) (a) (b) Figure 5.5: Classification Report and Confusion Matrix of KNN(K=2) using Borderline SMOTE+Random Oversampling (Benchmark Dataset with Feature Engineering) (a) (b) Figure 5.6: Classification Report and Confusion Matrix of SVM RBF(K=5) using SVMSMOTE+Random Undersampling (Benchmark Dataset with Feature Engineering) (a) (b) Figure 5.7: Classification Report and Confusion Matrix of SVM Linear using SMOTE+TOMEK (Benchmark Dataset with Feature Engineering)
  • 21. 13 (a) (b) Figure 5.8: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Benchmark Dataset with Feature Engineering) (a) (b) Figure 5.9: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) us- ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset without Feature Engineering) (a) (b) Figure 5.10: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) using SVMSMOTE+Random Undersampling (Non-enchmark Dataset without Feature Engi- neering) (a) (b) Figure 5.11: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Non-benchmark Dataset without Feature Engineering) (a) (b) Figure 5.12: Classification Report and Confusion Matrix of SVM Poly using SMOTE+ENN (Non-benchmark Dataset without Feature Engineering)
  • 22. 14 (a) (b) Figure 5.13: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=3) us- ing Borderline SMOTE+Random Oversampling (Non-benchmark Dataset with Feature En- gineering) (a) (b) Figure 5.14: Classification Report and Confusion Matrix of Gaussian Naive Bayes(K=1) using SVMSMOTE+Random Undersampling (Non-benchmark Dataset with Feature Engi- neering) (a) (b) Figure 5.15: Classification Report and Confusion Matrix of SVM Poly using SMOTE+TOMEK (Non-benchmark Dataset with Feature Engineering) (a) (b) Figure 5.16: Classification Report and Confusion Matrix of SVM Linear using SMOTE+ENN (Non-benchmark Dataset with Feature Engineering)
  • 23. 15 Chapter 6 Future Work and Conclusion In this paper, Decision tree has given worst result in all cases. Gaussian Naive Bayes has given a good recall value for label 1(stroke) in all cases. But most of the time, it has given less recall value for label 0(no stroke) compared to label 1(stroke) and has also conferred less accuracy. Generally, KNN and SVM has given good results comparing to all other models shown in Table 6.1. For SVM kernal=Poly, it has given a good result. For the non-benchmark dataset which is a huge dataset, SMOTE+TOMEK and SMOTE+ENN which are hybrid oversampling techique, has given better performance compared to Borderline SMOTE and SVMSMOTE. Even Gaussian Naive Bayes has given a good recall value for label 1 (stroke) for Border- line SMOTE and SVMSMOTE which is Good enough to compete with SMOTE+TOMEK and SMOTE+ENN. But Gaussian Naive Bayes as mentioned earlier does not give a good recall value for label 0 (no stroke) most of the time and also given low accuracy. So, overall SMOTE+TOMEK and SMOTE+ENN has performed better for the non-benchmark dataset. In this paper, a survey of different machine learning algorithms are used. In near future,this paper will be developed by applying deep learning algorithms too. Table 6.1: Result Benchmark Dataset Without Feature Engineering Benchmark Dataset with Feature Engineering Non-benchmark Dataset without Feature Engineering Non-benchmark Dataset with Feature Engineering BorderlineSMOTE+Random Undersampler KNN (K=5) KNN (K=2) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=3) SVMSMOTE+Random Undersampling Gaussian Naive Bayes (K=8) SVM RBF (K=5) Gaussian Naive Bayes (K=1) Gaussian Naive Bayes (K=1) SMOTE+TOMEK SVM Poly SVM Linear SVM Poly SVM Poly SMOTE+ENN SVM Poly SVM Poly SVM Poly SVM Linear
  • 24. 16 References [1] T. Rakshit and A. Shrestha, “Comparative analysis and implementation of heart stroke prediction using various machine learning techniques,” [2] H. Shashank, S. Srikanth, A. Thejas, et al., Prediction Of Stroke Using Machine Learning. PhD thesis, CMR Institute of Technology. Bangalore, 2020.
  • 25. Generated using Undegraduate Thesis L A TEX Template, Version 1.4. Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh. This project report was generated on Wednesday 29th September, 2021 at 4:11pm. 17