BRAIN STROKE PREDICTION
USING MACHINE LEARNING
TECHNIQUES
BY
NAME : S.RAJAYOGHA
BRANCH : M.SC DATA SCIENCE
REVIEW : FINAL REVIEW
GUIDE NAME : DR.K.SATHESH KUMAR
REVIEW DATE : 08/05/2023
1
• ABSTRACT
• INTRODUCTION
• SYMPTOMS
• EXISTING SYSTEM
• PROPOSED SYSTEM
• DATASET MODULES
• ARCHITECTURE
CONTENT(contd.,)
2
CONTENT(contd.,)
• DIAGRAMS
• ALGORITHMS
• EXPECTED OUTCOMES
• RESULT
• CONCLUSION
• REFERENCE
• PUBLICATION
3
ABSTRACT
• Stroke is a destructive illness that typically influences individuals over
the age of 65 years age.
• Prediction of stroke is a time consuming and tedious for doctors.
• Five different algorithms are used and a comparison is made for
better accuracy.
• Aim is to create an application with a user friendly interface which is
easy to navigate and enter inputs.
4
INTRODUCTION
• A stroke is a life-threatening condition that happens when part of your
brain doesn't have enough blood flow.
• An ischemic stroke is caused by a blockage cutting off the blood
supply to the brain. This is the most common type of stroke.
• A hemorrhagic stroke is caused by bleeding in or around the brain.
• A transient ischemic attack or TIA is also known as a mini-stroke.
5
INTRODUCTION(contd.,)
• Hemorrhagic strokes are particularly dangerous because they cause
severe symptoms that get worse quickly.
• The Stages of stroke are Stage 1: Flaccidity(soft). Stage 2: Spasticity
(clumsy neural problem). Stage 3: Increased Spasticity. Stage 4:
Decreased Spasticity
• Foods high in potassium such as sweet and white potatoes, bananas,
tomatoes, prunes, melon, and soybeans, can help you maintain healthy
blood pressure — the leading risk factor of stroke.
6
SYMPTOMS
7
EXISTING SYSTEM
• In recent times, the stress levels in individuals are at an all time high.
This increases the chances of strokes in individuals.
• About 3.0 million deaths resulted from ischemic stroke while 3.3
million deaths resulted from hemorrhagic stroke. Hence, correct detection
and finding presence of stroke inside a human becomes essential.
• In existing system, there are various medical instruments available in the
market for predicting brain stroke but they are very much expensive and
they are not efficient enough to be able to calculate the chance of having
a brain stroke.
8
DISADVANTAGE OF EXISTING SYSTEM
• Takes a lot of time to find the disease.
• Inaccuracy and inefficiency of results.
• This can leads to incomplete datacollection.
• It was not safe for private data collections in hospital.
• They may not be universally accessible or adopted by all hospital
providers or institution
9
PROPOSED SYSTEM
• Artificial Intelligence that contributes various algorithms and
many more which is effective in making decisions and predictions
from the large quantity of data produced by the healthcare industry.
• Based on the proposed problem, ML provides different
classification algorithms to divine the probability of a patient having
a Brain Stroke.
• In proposed system, there are various medical instruments
available in the market for predicting brain stroke but they are very
flexible cost and they are efficient enough to be able to calculate the
chance of predict a brain stroke. 10
ADVANTAGE OF PROPOSED SYSTEM
• It detects the brain stroke disease less time.
• More accuracy and efficiency.
• It was very safe for private data collections in hospital.
• They can universally accessible or adopted by all hospital providers
or institution
11
Software configure
12
Frontend and backend
I’m Using Python as the front and MySQL as the backend in a
healthcare data stroke, project can provide several benefits:
• 1. Python is a popular programming language for data analysis and
visualization, which can be useful in analyzing stroke data.
• 2. MySQL can handle large amounts of data and can be easily scaled to
meet the needs of the project.
• 3. The combination of Python and MySQL can provide seamless
integration between the front end and back end, making it easier to
manage and analyze data.
13
Frontend and backend
• 4. Python has a wide range of libraries and frameworks that can be used
to build interactive and user-friendly interfaces for the project.
• 5. MySQL is known for its reliability and stability, which is crucial in a
healthcare project where the accuracy and consistency of data are critical.
• 6.Overall, using Python for the front end and MySQL for the back end
in a healthcare data stroke project can provide a powerful and efficient
solution for managing and analyzing healthcare data.
14
MODULES
From the Kaggle website, https://www.kaggle.com/healthcare-dataset-
stroke-data.
15
MODULES
Balancing Dataset:
• There were 5110 rows and 12 columns in this dataset.
• The value of the output column stroke is either 1 or 0. The number
0 indicates that no stroke risk was identified, while the value 1
indicates that a stroke risk was detected.
• The probability of 0 in the output column (stroke) exceeds the
possibility of 1 in the same column in this dataset. In this Dataset 0
and 1 values are given below. values Stroke/not a stroke
O values Not a stroke
1 values stroke
16
MODULES
Preprocessing
 Before building a model, data preprocessing is required to remove unwanted
noise and outliers from the dataset that could lead the model to depart from its
intended training.
 This stage addresses everything that prevents the model from functioning more
efficiently. Following the collection of the relevant dataset, the data must be
cleaned and prepared for model development. As stated before, the dataset used has
twelve characteristics.
 To improve accuracy, data preprocessing is used to balance the data.It contains
the total number of stroke and non stroke records in the output column before
preprocessing.
17
MODULES
• Correlation Matrix:
In the above heatmap, For Data Analyzing ,we can see that
there is no multicollinearity present and ‘Age’ and ‘Glucose
Level’ are some of the highest correlated features with ‘Stroke’.
• Best Features using Chi-Square Test:
In the above table, we can see that Age, Average Glucose Level
and Hypertension are the top 3 features having maximum impact
on output ‘Stroke’. Chi Square Test is used to find out this result.
18
Correlation matrix
FIG1 : CORRELATION MATRIX
19
Correlation matrix
FIG2:DATA VISUALIZATION OF PARAMETER IN
CORELATION
20
MODULES
Evaluation Matrix
 The confusion matrix is a tool for evaluating the performance of
machine learning classification algorithms. The confusion matrix has been
used to test the efficiency of all models created. The confusion matrix
illustrates how often our models forecast correctly and how often they
estimate incorrectly.
 False positives and false negatives have been allocated to badly
predicted values, whereas true positives and true negatives were assigned
to properly anticipated values. The model’s accuracy, precision-recall trade-
off, and AUC were utilized to assess its performance after grouping all
predicted values in the matrix.
21
ARCHITECTURAL DESIGN FOR PROPOSED
SYSTEM START
COLLECT
DATASET
DATA CLEANING
SPLITING DATA
PERFORM DATA
BALANCING
DATASET
TESTIN DATA(20%)
TRAINING
DATA(80%)
CLASSIFICATION CLAS S IFIER
TRAINING
(LOGISTICREG,RANDOM
FOREST,DECISSION
TREE,XGBOOST
STROKE:
YES
STROKE:
NO
FIG 3:ARCHITECTURE
22
ER DIAGRAM (User)
FIG 4: ER DIAGRAM FOR USER
23
ER DIAGRAM (Admin)
FIG 5: ER DIAGRAM FOR ADMIN
24
DATA FLOW DIAGRAM
FIG 6: DATA FLOW DIAGRAM
25
USE CASE DIAGRAM (User)
FIG 7: USE CASE DIAGRAM FOR USER
26
USE CASE DIAGRAM (Admin)
FIG 8:USE CASE DIAGRAM FOR ADMIN
27
ALGORITHM/TECHNIQUE USED WITH
COMPLEXITY
Extreme Gradient Boosting Classifier:
 Xgboost is a decision-tree-based ensemble Machine Learning
algorithm that uses a gradient boosting framework.
 In prediction problems involving unstructured data (images, text,
etc.) artificial neural networks tend to outperform all other algorithms or
frameworks. However, when it comes to small to-medium
structured/tabular data, decision tree based b algorithms are considered
best-in-class right now.
28
ALGORITHM/TECHNIQUE USED WITH
COMPLEXITY
Random Forest:
 Random Forest is a mainstream ML calculation that has a place with the
administered learning procedure.
 It tends to be utilized for both Arrangement and Relapse issues in ML.
 It depends on the idea of ensemble learning, which is a cycle of
joining different classifiers to take care of a complex problem
29
ALGORITHM/TECHNIQUE USED WITH
COMPLEXITY
Logistic Regression:
 Logistic regression is a statistical model that in its basic form uses a
logistic function to model a binary dependent variable, although many more
complex extensions exist.
 In regression analysis, logistic regression (or logit regression) is estimating
the parameters of a logistic model (a form of binary regression).
 linear combination of one or more independent variables ("predictors"); the
independent variables can each be a binary variable (two classes, coded by an
indicator variable) or a continuous variable (any real value).
30
EXPECTED OUTCOMES
Home Login
Above screenshot ,shows Home login .
31
EXPECTED OUTCOMES
Patient Login
Above screenshot ,shows patient login .
32
EXPECTED OUTCOMES
Stroke prediction by giving patient’s data
Above screenshot ,shows prediction form.
33
EXPECTED OUTCOMES
Display whether stroke or not
Above screenshot ,shows prediction result
34
EXPECTED OUTCOMES
Doctor login
Above screenshot ,shows prediction result
35
EXPECTED OUTCOMES
Logistic Regression
Above screenshot ,shows logistic regression.
36
EXPECTED OUTCOMES
Decision Tree
Above screenshot ,shows Decission Tree.
37
EXPECTED OUTCOMES
Random Forest
Above screenshot ,shows Random forest
38
EXPECTED OUTCOMES
XgBoost
Above screenshot ,shows Xgboost.
39
EXPECTED OUTCOMES
Algorithm Comparison
Above screenshot ,shows Algorithm Comparisons .
40
RESULT
algorithm results
.
Algorith
m
F1 score Precision Recall Accuracy
Logistic
Regression
0.81 0.80 0.81 0.81
Decision tree 0.92 0.91 0.94 0.92
Random
forest
0.95 0.93 0.96 0.95
Xgboost 0.96 0.96 0.96 0.96
41
Therefore, in Result this project helps us to predict the patients who are diagnosed with brain
stroke by cleaning the dataset and applying Model to get an accuracy of an average of
96.68%.here highest accuracy is Xgboost
CONCLUSION
 The importance of knowing and understanding the risks of brain stroke is
very much in these trying times.
 The model predicts the probability of brain stroke on the basis of very
trivial day-to-day and known to all parameters.
 This makes this project highly relevant and of need to society. The objective
of implementing the project on a web platform was to reach as many individuals
as possible.
 The early warning can save someone’s life who might have a probability of a
stroke.
42
REFERENCES
• [1] Tasfia Ismail Shoily, Tajul Islam, Sharmin Akter Tanna
"Detection of stroke using machine learning algorithms", 10th International
Conference on Computing, Communication and Networking Technologies
(ICCCNT), IEEE, July 2019.
• [2] JoonNyung Heo , Jihoon G. Yoon , Hyungjong Park , Young
Dae Kim , Hyo Suk Nam and Ji Hoe Heo. "Stroke prediction in acute
stroke", Stroke. 2019;50:1263-1265, AHA Journal, 20 Mar 2019.
• [3] Jaehak Yu,Damee Kim,Hongkyu Park,,Sun-Jin Kim,Sungkyu
Yu,Sejin Park and Seunghee “Semantic analysis of NIH stroke” , 2019
International Conference on Platform Technology and Service (PlatCon),
IEEE, 30 Jan 2019. 43
PUBLICATION
• Rajayogha, S.,& Bruxella, D.J.M.D. (2023, March31).Early prediction
of Brain Stroke using Logistic regression. International journal for
Research Applied science and Engineering Technology,11(3), 1355-
1361.
https://doi.org/10.22214/ijraset.2023.49651
44
THANK YOU
45

final.pptx

  • 1.
    BRAIN STROKE PREDICTION USINGMACHINE LEARNING TECHNIQUES BY NAME : S.RAJAYOGHA BRANCH : M.SC DATA SCIENCE REVIEW : FINAL REVIEW GUIDE NAME : DR.K.SATHESH KUMAR REVIEW DATE : 08/05/2023 1
  • 2.
    • ABSTRACT • INTRODUCTION •SYMPTOMS • EXISTING SYSTEM • PROPOSED SYSTEM • DATASET MODULES • ARCHITECTURE CONTENT(contd.,) 2
  • 3.
    CONTENT(contd.,) • DIAGRAMS • ALGORITHMS •EXPECTED OUTCOMES • RESULT • CONCLUSION • REFERENCE • PUBLICATION 3
  • 4.
    ABSTRACT • Stroke isa destructive illness that typically influences individuals over the age of 65 years age. • Prediction of stroke is a time consuming and tedious for doctors. • Five different algorithms are used and a comparison is made for better accuracy. • Aim is to create an application with a user friendly interface which is easy to navigate and enter inputs. 4
  • 5.
    INTRODUCTION • A strokeis a life-threatening condition that happens when part of your brain doesn't have enough blood flow. • An ischemic stroke is caused by a blockage cutting off the blood supply to the brain. This is the most common type of stroke. • A hemorrhagic stroke is caused by bleeding in or around the brain. • A transient ischemic attack or TIA is also known as a mini-stroke. 5
  • 6.
    INTRODUCTION(contd.,) • Hemorrhagic strokesare particularly dangerous because they cause severe symptoms that get worse quickly. • The Stages of stroke are Stage 1: Flaccidity(soft). Stage 2: Spasticity (clumsy neural problem). Stage 3: Increased Spasticity. Stage 4: Decreased Spasticity • Foods high in potassium such as sweet and white potatoes, bananas, tomatoes, prunes, melon, and soybeans, can help you maintain healthy blood pressure — the leading risk factor of stroke. 6
  • 7.
  • 8.
    EXISTING SYSTEM • Inrecent times, the stress levels in individuals are at an all time high. This increases the chances of strokes in individuals. • About 3.0 million deaths resulted from ischemic stroke while 3.3 million deaths resulted from hemorrhagic stroke. Hence, correct detection and finding presence of stroke inside a human becomes essential. • In existing system, there are various medical instruments available in the market for predicting brain stroke but they are very much expensive and they are not efficient enough to be able to calculate the chance of having a brain stroke. 8
  • 9.
    DISADVANTAGE OF EXISTINGSYSTEM • Takes a lot of time to find the disease. • Inaccuracy and inefficiency of results. • This can leads to incomplete datacollection. • It was not safe for private data collections in hospital. • They may not be universally accessible or adopted by all hospital providers or institution 9
  • 10.
    PROPOSED SYSTEM • ArtificialIntelligence that contributes various algorithms and many more which is effective in making decisions and predictions from the large quantity of data produced by the healthcare industry. • Based on the proposed problem, ML provides different classification algorithms to divine the probability of a patient having a Brain Stroke. • In proposed system, there are various medical instruments available in the market for predicting brain stroke but they are very flexible cost and they are efficient enough to be able to calculate the chance of predict a brain stroke. 10
  • 11.
    ADVANTAGE OF PROPOSEDSYSTEM • It detects the brain stroke disease less time. • More accuracy and efficiency. • It was very safe for private data collections in hospital. • They can universally accessible or adopted by all hospital providers or institution 11
  • 12.
  • 13.
    Frontend and backend I’mUsing Python as the front and MySQL as the backend in a healthcare data stroke, project can provide several benefits: • 1. Python is a popular programming language for data analysis and visualization, which can be useful in analyzing stroke data. • 2. MySQL can handle large amounts of data and can be easily scaled to meet the needs of the project. • 3. The combination of Python and MySQL can provide seamless integration between the front end and back end, making it easier to manage and analyze data. 13
  • 14.
    Frontend and backend •4. Python has a wide range of libraries and frameworks that can be used to build interactive and user-friendly interfaces for the project. • 5. MySQL is known for its reliability and stability, which is crucial in a healthcare project where the accuracy and consistency of data are critical. • 6.Overall, using Python for the front end and MySQL for the back end in a healthcare data stroke project can provide a powerful and efficient solution for managing and analyzing healthcare data. 14
  • 15.
    MODULES From the Kagglewebsite, https://www.kaggle.com/healthcare-dataset- stroke-data. 15
  • 16.
    MODULES Balancing Dataset: • Therewere 5110 rows and 12 columns in this dataset. • The value of the output column stroke is either 1 or 0. The number 0 indicates that no stroke risk was identified, while the value 1 indicates that a stroke risk was detected. • The probability of 0 in the output column (stroke) exceeds the possibility of 1 in the same column in this dataset. In this Dataset 0 and 1 values are given below. values Stroke/not a stroke O values Not a stroke 1 values stroke 16
  • 17.
    MODULES Preprocessing  Before buildinga model, data preprocessing is required to remove unwanted noise and outliers from the dataset that could lead the model to depart from its intended training.  This stage addresses everything that prevents the model from functioning more efficiently. Following the collection of the relevant dataset, the data must be cleaned and prepared for model development. As stated before, the dataset used has twelve characteristics.  To improve accuracy, data preprocessing is used to balance the data.It contains the total number of stroke and non stroke records in the output column before preprocessing. 17
  • 18.
    MODULES • Correlation Matrix: Inthe above heatmap, For Data Analyzing ,we can see that there is no multicollinearity present and ‘Age’ and ‘Glucose Level’ are some of the highest correlated features with ‘Stroke’. • Best Features using Chi-Square Test: In the above table, we can see that Age, Average Glucose Level and Hypertension are the top 3 features having maximum impact on output ‘Stroke’. Chi Square Test is used to find out this result. 18
  • 19.
    Correlation matrix FIG1 :CORRELATION MATRIX 19
  • 20.
    Correlation matrix FIG2:DATA VISUALIZATIONOF PARAMETER IN CORELATION 20
  • 21.
    MODULES Evaluation Matrix  Theconfusion matrix is a tool for evaluating the performance of machine learning classification algorithms. The confusion matrix has been used to test the efficiency of all models created. The confusion matrix illustrates how often our models forecast correctly and how often they estimate incorrectly.  False positives and false negatives have been allocated to badly predicted values, whereas true positives and true negatives were assigned to properly anticipated values. The model’s accuracy, precision-recall trade- off, and AUC were utilized to assess its performance after grouping all predicted values in the matrix. 21
  • 22.
    ARCHITECTURAL DESIGN FORPROPOSED SYSTEM START COLLECT DATASET DATA CLEANING SPLITING DATA PERFORM DATA BALANCING DATASET TESTIN DATA(20%) TRAINING DATA(80%) CLASSIFICATION CLAS S IFIER TRAINING (LOGISTICREG,RANDOM FOREST,DECISSION TREE,XGBOOST STROKE: YES STROKE: NO FIG 3:ARCHITECTURE 22
  • 23.
    ER DIAGRAM (User) FIG4: ER DIAGRAM FOR USER 23
  • 24.
    ER DIAGRAM (Admin) FIG5: ER DIAGRAM FOR ADMIN 24
  • 25.
    DATA FLOW DIAGRAM FIG6: DATA FLOW DIAGRAM 25
  • 26.
    USE CASE DIAGRAM(User) FIG 7: USE CASE DIAGRAM FOR USER 26
  • 27.
    USE CASE DIAGRAM(Admin) FIG 8:USE CASE DIAGRAM FOR ADMIN 27
  • 28.
    ALGORITHM/TECHNIQUE USED WITH COMPLEXITY ExtremeGradient Boosting Classifier:  Xgboost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.  In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small to-medium structured/tabular data, decision tree based b algorithms are considered best-in-class right now. 28
  • 29.
    ALGORITHM/TECHNIQUE USED WITH COMPLEXITY RandomForest:  Random Forest is a mainstream ML calculation that has a place with the administered learning procedure.  It tends to be utilized for both Arrangement and Relapse issues in ML.  It depends on the idea of ensemble learning, which is a cycle of joining different classifiers to take care of a complex problem 29
  • 30.
    ALGORITHM/TECHNIQUE USED WITH COMPLEXITY LogisticRegression:  Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist.  In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).  linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). 30
  • 31.
    EXPECTED OUTCOMES Home Login Abovescreenshot ,shows Home login . 31
  • 32.
    EXPECTED OUTCOMES Patient Login Abovescreenshot ,shows patient login . 32
  • 33.
    EXPECTED OUTCOMES Stroke predictionby giving patient’s data Above screenshot ,shows prediction form. 33
  • 34.
    EXPECTED OUTCOMES Display whetherstroke or not Above screenshot ,shows prediction result 34
  • 35.
    EXPECTED OUTCOMES Doctor login Abovescreenshot ,shows prediction result 35
  • 36.
    EXPECTED OUTCOMES Logistic Regression Abovescreenshot ,shows logistic regression. 36
  • 37.
    EXPECTED OUTCOMES Decision Tree Abovescreenshot ,shows Decission Tree. 37
  • 38.
    EXPECTED OUTCOMES Random Forest Abovescreenshot ,shows Random forest 38
  • 39.
  • 40.
    EXPECTED OUTCOMES Algorithm Comparison Abovescreenshot ,shows Algorithm Comparisons . 40
  • 41.
    RESULT algorithm results . Algorith m F1 scorePrecision Recall Accuracy Logistic Regression 0.81 0.80 0.81 0.81 Decision tree 0.92 0.91 0.94 0.92 Random forest 0.95 0.93 0.96 0.95 Xgboost 0.96 0.96 0.96 0.96 41 Therefore, in Result this project helps us to predict the patients who are diagnosed with brain stroke by cleaning the dataset and applying Model to get an accuracy of an average of 96.68%.here highest accuracy is Xgboost
  • 42.
    CONCLUSION  The importanceof knowing and understanding the risks of brain stroke is very much in these trying times.  The model predicts the probability of brain stroke on the basis of very trivial day-to-day and known to all parameters.  This makes this project highly relevant and of need to society. The objective of implementing the project on a web platform was to reach as many individuals as possible.  The early warning can save someone’s life who might have a probability of a stroke. 42
  • 43.
    REFERENCES • [1] TasfiaIsmail Shoily, Tajul Islam, Sharmin Akter Tanna "Detection of stroke using machine learning algorithms", 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, July 2019. • [2] JoonNyung Heo , Jihoon G. Yoon , Hyungjong Park , Young Dae Kim , Hyo Suk Nam and Ji Hoe Heo. "Stroke prediction in acute stroke", Stroke. 2019;50:1263-1265, AHA Journal, 20 Mar 2019. • [3] Jaehak Yu,Damee Kim,Hongkyu Park,,Sun-Jin Kim,Sungkyu Yu,Sejin Park and Seunghee “Semantic analysis of NIH stroke” , 2019 International Conference on Platform Technology and Service (PlatCon), IEEE, 30 Jan 2019. 43
  • 44.
    PUBLICATION • Rajayogha, S.,&Bruxella, D.J.M.D. (2023, March31).Early prediction of Brain Stroke using Logistic regression. International journal for Research Applied science and Engineering Technology,11(3), 1355- 1361. https://doi.org/10.22214/ijraset.2023.49651 44
  • 45.