2. KOYEL MAJUMDAR
Registration No: 1011621400088 Roll No: 22342106
RINA PAUL
Registration No: 2014009633 Roll No: 22342110
Guided By:
Dr. KUNAL DAS
Assistant Professor, Acharya Prafulla Chandra College
Presented By:
3. OUTLINE
• Motivation
• Objectives
• Related Work
• Research Gap
• Project Novelty
• Design Model
• Introduction to Machine Learning
• Project work
• Algorithms
• Design
• Result and Discussion
• Future work
• References
4. Motivation
Less availability of Health centre
Less availability of doctors
Distance of health centre
Recognize the disease from symptoms
Less availability of medicine
Point of care facility
5. Objectives
Our main aim is to provide a quick medical diagnosis to the
patients living in rural areas.
Nowadays, it is very useful for postcovid contactless system in
rural health service.
The goal is to provide access to medical specialists.
This system enhance quality of health care.
6. Related Work
Shahadat Uddin, Arif Khan, Md Ekramul Hossain and Mohammand Ali Moni
Worked on the comparing different supervised machine learning algorithms
for disease prediction. Here they selected LR, SVM, Decision tree, Random
forest, Naïve Bayes, K-nearest neighbour and ANN.
Reference:
https://link.springer.com/article/10.1186/s12911-019-1004-8
7. Research Gap
The reality of modern medicine is that there are many more patients than doctors who are
not able to help them.
The recent pandemic has proven that the world is not ready for emergencies of that sort and there is a
global need for qualified doctors.
Everyday complex and new diseases increase continuously. This led to higher number of peoples
deprived from medical facility.
ML techniques will facilitate doctors to create an correct identification for disease. Moreover, the cost
of emergency health diagnosis is affordable to a wide range of people.
Doctors will be able to communicate through this system even if they do not present in health
centre.
8. Project Novelty
Depending on the disease being diagnosed, a specialist doctor will assign for
this disease in our system.
Out of five algorithms used in this project, in most cases three algorithms are
giving accurate results.
Our system is a well understood system for users. GUI is designed in such a
simple way that everyone can easily operate on it.
10. Machine Learning and it’s
Algorithm
Machine Learning uses programmed algorithms that learn and optimize their
operations by analysing input data to make predictions within an acceptable
range. Machine learning algorithms they can be divided into three broad
categories according to their purposes. These three categories are: supervised,
unsupervised and semi-supervised.
In supervised machine learning algorithms, a labelled training dataset is used
first to train the underlying algorithm. This trained algorithm is then fed on the
unlabelled test dataset to categorise them into similar groups.
For disease prediction the learning model include Support Vector Machine,
Decision Tree, Random Forest, Naïve Bayes, K-nearest neighbour.
11. Here we use five different algorithms in machine learning.
The first algorithm is a Decision Tree, second is a Random Forest ,third is Naive Bayes,
fourth is K-Nearest Neighbor and the last one is Support Vector Machine.
We are going to import Pandas for manipulating the CSV file, Numpy, Sklearn for the
algorithms and Tkinter for our GUI stuff.
Now our first step is to make a list or dataset of the symptoms and diseases.
The dataset is given below:
Prototype.csv
Prototype1.csv
The doctor dataset for this project is:
C:UserskushalDOCTORLISTEXCELCSV2.csv
16. Traverse the file as shown in the code and store them into an x_test and
y_test. Then Ravel the y_text using the Numpy module.
Now the main part of machine learning comes here i.e the training and testing
of the code or model.
So the training file is named as prototype.csv in our program and the testing file
is named as prototype 1.csv.
18. Used Algorithms
Decision Tree Algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
19. Used Algorithms
Random Forest Algorithm:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision
tree, and assign the new data points to the category that wins the
majority votes.
20. Used Algorithms
Naïve Bayes Algorithm:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.
Bayes' Theorem:
•Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
•The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
21. Used Algorithms
K-Nearest Neighbor Algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
Step-4: Among these k neighbors, count the number of the data points
in each category.
Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
22. Used Algorithms
Support Vector Machine Algorithm:
Step 1:find the best line or decision boundary; this best boundary or region is called as
a hyperplane.
Step 2: finds the closest point of the lines from both the classes. These points are called
support vectors.
Step 3: The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is Optimal
hyperplane.
Step 4: for non-linear data, we cannot draw a single straight line.
Step 5: So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z.
Step 6: Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it into 2-d then we get a circumference of radius 1 in case of non-linear data.
23. Accuracy =
TP+TN
TP+TN+FP+FN
F1 Score = 2 x TP
2 x TP+FN+FP
Precision =
TP
TP+FP
True Positive Rate =
TP
TP+FN
False Positive Rate =
FP
FP+TN
Here TP= True Positive
TN= True Negative
FP= False Positive
FN= False Negative
TP FN
FP TN
P N
N
P
Actual
Class
Predicted Class
25. Design
Now design the GUI for our project.
For designing GUI we use Tkinter. First to set background image we create canvas and set width and
height. Then display image using create_image. For text we use Label and then for selecting symptoms
here use OptionMenu. Here we create Button for “Prediction” and “Doctor Appointment”. If we press
“Prediction” button then all the five algorithms give the result. Then we choose which disease predict
maximum times. Then this disease will be our result. When we press “Doctor Appointment” button, all
the names and time schedule of the specialist doctor with respect to predicted disease will show in the list
box. Here we use listBox for showing name and time schedule of specialist. For showing predicted
disease we use Text. Here for GUI we use grid system.
33. Conclusion
•Disease prediction and doctor appointment system using machine learning is very much useful in
everyone’s day to day life and it is mainly more important for the health care sector.
•It is useful for the user in case he/she doesn’t want to go to the hospital or any other clinics, so just by
entering the symptoms and all other useful information the user can get to know the disease he/she is
suffering from and also get doctor suggestion and the health industry can also get benefit from this
system by just asking the symptoms from the user and entering in the system and in just few seconds
they can tell the exact and up to some extent the accurate diseases and doctor prescribed medicine via
internet.
•The Disease prediction is to provide prediction for the various and generally occurring diseases that
when unchecked and sometimes ignored can turns into fatal disease and cause lot of problem to the
patient and as well as their family members.
34. Future work
The project will later work on medicine and the project will be completed
online via Internet.
Later an web application will be created with the whole project which can be
used for the benefit of the public.
37. Appendix
Algorithm
1. Support Vector Machine: Support vector machine (SVM) algorithm can classify both
linear and non-linear data. It first maps each data item into an n-dimensional feature space
where n is the number of features. It then identifies the hyperplane that separates the data
items into two classes while maximising the marginal distance for both classes and minimising
the classification errors.
2. Naïve Bayes: Naïve Bayes (NB) is a classification technique based on the Bayes’
theorem.This theorem can describe the probability of an event based on the prior knowledge
of conditions related to that event. This classifier assumes that a particular feature in a class is
not directly related to any other feature although features for that class could have
interdependence among themselves.
38. 3. Random Forest: A random forest (RF) is an ensemble classifier and consisting of many DTs similar to
the way a forest is a collection of many trees .DTs that are grown very deep often cause overfitting of
the training data, resulting a high variation in classification outcome for a small change in the input data.
They are very sensitive to their training data, which makes them error-prone to the test dataset. The
different DTs of an RF are trained using the different parts of the training dataset. To classify a new
sample, the input vector of that sample is required to pass down with each DT of the forest. Each DT
then considers a different part of that input vector and gives a classification outcome.
4. Decision Tree: Decision tree (DT) is one of the earliest and prominent machine learning algorithms.
A decision tree models the decision logics i.e., tests and corresponds outcomes for classifying data items
into a tree-like structure. The nodes of a DT tree normally have multiple levels where the first or top-
most node is called the root node. All internal nodes (i.e., nodes having at least one child) represent
tests on input variables or attributes. Depending on the test outcome, the classification algorithm
branches towards the appropriate child node where the process of test and branching repeats until it
reaches the leaf node . The leaf or terminal nodes correspond to the decision outcomes. DTs have been
found easy to interpret and quick to learn, and are a common component to many medical diagnostic
protocols.
39. 5.K-nearest neighbor: The K-nearest neighbor (KNN) algorithm is one of the simplest and
earliest classification algorithms. It can be thought a simpler version of an NB classifier. Unlike the
NB technique, the KNN algorithm does not require to consider probability values. The ‘K’ is the
KNN algorithm is the number of nearest neighbors considered to take ‘vote’ from. The selection
of different values for ‘K’ can generate different classification results for the same sample object.
40. Computational Complexity of Algorithms
The train time complexity of KNN= O(knd) where k=no of neighbors, n=no of training examples
and d=no of dimensions of data
Space complexity=O(nd)
Training time complexity of SVM=O(n^2)
Run time complexity=O(k*d) where k=no of support vector
Training time complexity of DT=O(n*log(n)*d) where n=no of points in the training set
Run time complexity=O(maximum depth of tree)
Training time complexity of RF=O(n*log(n)*d*k)
Run time complexity=O(depth of tree *k)
Space complexity=O(depth of tree *k)
Training time complexity of NB=O(n*d)
Run time complexity=O(c*d) where c=no. of classes