Assignment No.#4 - Creating A Clustering Model For Cancer Data.
Student Name & ID: Ravi Nakulan (XXXXX)
Professor: S. P. (DATA 1200 – 01)
College: Durham College
Tools Used: Python (Jupyter Notebook)
Due Date: March 12th, 2021
Ravi Nakulan 1
ANALYSIS DESCRIPTION
Ravi Nakulan 2
Mr. John Hughes would like to use SVM & Naïve Bayes algorithms on the cancer.csv dataset in order to
assess the best results out of these two Models.
Both of them are supervised machine learning algorithm used for classification problems.
Footnote:
*SVM: Support Vector Machine
A Comparison: SVM Vs Naïve Bays
Ravi Nakulan 3
• In order to run both the algorithms for comparison, I have used Jupyter notebook as an IDE to write my
Python codes.
• Here are the following codes and their descriptions accordingly.
• Named the file as: Clustering Model_Cancer Data_Assignment No. 04
Footnote:
*IDE: Integrated Development Environment
In[1]. Loaded Necessary Libraries:
Pandas: Mainly to work with tabular data; 2d
table object (DataFrame)
Numpy: Mainly to work with numerical data;
multi-dimensional array
Matplotlib.pyplot: To use collection of
functions to create a figure, plotting area in
figure, plot lines etc.
A Comparison: SVM Vs Naïve Bays…continued
Ravi Nakulan 4
• After loading the necessary libraries, we loaded our cancer.csv dataset from our respective folder
In[2]. Load Dataset
• Used the code pd.read_csv(./cancer.csv’) to
call the data
• cancer.head () gives us first 05 rows starting
from serial number 0 (zero) till number 4
• We got our top 5 rows after running our
code
• Where “Class” is our dependent variable (y)
• From id to Mitoses are the independent
variables (x)
• Our “Class” has two values 2 & 4
• 2 means: benign (no cancer)
• 4 means: malignant (cancer)
A Comparison: SVM Vs Naïve Bays…continued
Ravi Nakulan 5
• Dropped Class from x axis which is our dependent
variable and define as y=cancer[‘Class’].to_numpy()
• We train (20%) & test (80%) of the data and used
random_state=100 to get the same result every time
and avoid getting different results, when we run the
code
• To Standardize or Scaling the dataset sc.fit helps to
structure the data and preprocessing information of
existing categories and then sc.transform helps to
transform the preprocessing information
In[3]. We created x & y variables
• Codes and their descriptions…
A Comparison: SVM Vs Naïve Bays…continued
Ravi Nakulan 6
The code used as a for-loop function, and it
lookup at the first line :
and runs the code and come back again to
see if its there any other to run the code again
and creates automatically both outputs (SVM
& Naïve Bayes)
N
B
In[4]. Script for SVM and NB
S
V
M
• Codes and their descriptions…
We did the necessary imports and create the
dataset
Footnote:
*NB: Naïve Bayes
A Comparison: SVM Vs Naïve Bays…continued
Ravi Nakulan 7
• We got the output of our codes from in[3] & in[4]
Precision: positive predictive values
Recall: truth/sensitivity
f1-score: weighted average of Precision & Recall
Accuracy: fraction of our prediction that how much the model is right
Precision, determines how many of them actual positive, as predicted positive. No Cancer accuracy is 98%
but Cancer positive accuracy is 100%
Recall, calculates how many of the actual positives our model has predicted/captured as True Positive
So, for No Cancer it is 100% and for Cancer it is 96%. Very important to predict cancer positive patient, but
the prediction is 96% which is a concern.
f1-score: f1-score balances between Precision & Recall. So SVM no cancer score is 99% while cancer score
is 98%
Accuracy: It’s a ratio of correctly predicted observations to the total observations, which is 99%
Precision, determines how many of them actual positive, as predicted positive. No Cancer accuracy is 99%
but Cancer positive accuracy is 94%
Recall, calculates how many of the actual positives our model has predicted/captured as True Positive
So, for No Cancer it is 97% and for Cancer it is 98%. Very important to predict cancer positive patient, and it
is better than SVM prediction as 96% to 98% through NB.
f1-score: f1-score balances between Precision & Recall. So NB no cancer score is 98% while cancer score is
96%
Accuracy: It’s a ratio of correctly predicted observations to the total observations, which is 97%
Conclusion : SVM Vs Naïve Bays
Ravi Nakulan 8
Naïve Bayes: We cannot suggest this model to Mr. John Huges.
Explanation:
Although, NB is the good starting point to develop a model because it is a simple algorithm in terms of overall run time and its
quick even if we have 30-50 thousands lines of data but cannot be set with all algorithm, especially when we have multiple
independent variables and they are interdependent to each other and accuracy of the prediction is highly imperative as required in
Medical field.
Highlights of the Results between NB & SVM
• SVM got twice the 100% prediction; one under precision for cancer & under recall for no-cancer .
• The lowest prediction score is 96% for SVM, which came from recall for cancer but still considered very high.
• The f1-score which balances between precision & recall is better than NB for both cancer and no-cancer.
• The overall weighted avg for SVM is 99% vs 97% for NB model, which indicates that SVM is better choice.
SVM: SVM is the right choice for Mr. John Huges
Explanation:
SVM does not effected by numbers of independent variables and it does not care about their interrelationship, but with the
larger dataset we can go from minutes to hours to days to weeks as a drawback of running SVM.
However, this dataset is fairly small to run and therefore for Mr. John Hughes SVM is the right choice because it’s a popular
medical model and we need highest degree of accuracy.
SVM did able to find the better line between cancer and no-cancer rather than just trying to look for probability as NB base
model did.
As a conclusion, although NB has produced significantly very high 97% of probability but SVM has done better job by
optimizing the model better than NB and found cancer and no-cancer with 99% prediction.

SVM Vs Naive Bays Algorithm (Jupyter Notebook)

  • 1.
    Assignment No.#4 -Creating A Clustering Model For Cancer Data. Student Name & ID: Ravi Nakulan (XXXXX) Professor: S. P. (DATA 1200 – 01) College: Durham College Tools Used: Python (Jupyter Notebook) Due Date: March 12th, 2021 Ravi Nakulan 1
  • 2.
    ANALYSIS DESCRIPTION Ravi Nakulan2 Mr. John Hughes would like to use SVM & Naïve Bayes algorithms on the cancer.csv dataset in order to assess the best results out of these two Models. Both of them are supervised machine learning algorithm used for classification problems. Footnote: *SVM: Support Vector Machine
  • 3.
    A Comparison: SVMVs Naïve Bays Ravi Nakulan 3 • In order to run both the algorithms for comparison, I have used Jupyter notebook as an IDE to write my Python codes. • Here are the following codes and their descriptions accordingly. • Named the file as: Clustering Model_Cancer Data_Assignment No. 04 Footnote: *IDE: Integrated Development Environment In[1]. Loaded Necessary Libraries: Pandas: Mainly to work with tabular data; 2d table object (DataFrame) Numpy: Mainly to work with numerical data; multi-dimensional array Matplotlib.pyplot: To use collection of functions to create a figure, plotting area in figure, plot lines etc.
  • 4.
    A Comparison: SVMVs Naïve Bays…continued Ravi Nakulan 4 • After loading the necessary libraries, we loaded our cancer.csv dataset from our respective folder In[2]. Load Dataset • Used the code pd.read_csv(./cancer.csv’) to call the data • cancer.head () gives us first 05 rows starting from serial number 0 (zero) till number 4 • We got our top 5 rows after running our code • Where “Class” is our dependent variable (y) • From id to Mitoses are the independent variables (x) • Our “Class” has two values 2 & 4 • 2 means: benign (no cancer) • 4 means: malignant (cancer)
  • 5.
    A Comparison: SVMVs Naïve Bays…continued Ravi Nakulan 5 • Dropped Class from x axis which is our dependent variable and define as y=cancer[‘Class’].to_numpy() • We train (20%) & test (80%) of the data and used random_state=100 to get the same result every time and avoid getting different results, when we run the code • To Standardize or Scaling the dataset sc.fit helps to structure the data and preprocessing information of existing categories and then sc.transform helps to transform the preprocessing information In[3]. We created x & y variables • Codes and their descriptions…
  • 6.
    A Comparison: SVMVs Naïve Bays…continued Ravi Nakulan 6 The code used as a for-loop function, and it lookup at the first line : and runs the code and come back again to see if its there any other to run the code again and creates automatically both outputs (SVM & Naïve Bayes) N B In[4]. Script for SVM and NB S V M • Codes and their descriptions… We did the necessary imports and create the dataset Footnote: *NB: Naïve Bayes
  • 7.
    A Comparison: SVMVs Naïve Bays…continued Ravi Nakulan 7 • We got the output of our codes from in[3] & in[4] Precision: positive predictive values Recall: truth/sensitivity f1-score: weighted average of Precision & Recall Accuracy: fraction of our prediction that how much the model is right Precision, determines how many of them actual positive, as predicted positive. No Cancer accuracy is 98% but Cancer positive accuracy is 100% Recall, calculates how many of the actual positives our model has predicted/captured as True Positive So, for No Cancer it is 100% and for Cancer it is 96%. Very important to predict cancer positive patient, but the prediction is 96% which is a concern. f1-score: f1-score balances between Precision & Recall. So SVM no cancer score is 99% while cancer score is 98% Accuracy: It’s a ratio of correctly predicted observations to the total observations, which is 99% Precision, determines how many of them actual positive, as predicted positive. No Cancer accuracy is 99% but Cancer positive accuracy is 94% Recall, calculates how many of the actual positives our model has predicted/captured as True Positive So, for No Cancer it is 97% and for Cancer it is 98%. Very important to predict cancer positive patient, and it is better than SVM prediction as 96% to 98% through NB. f1-score: f1-score balances between Precision & Recall. So NB no cancer score is 98% while cancer score is 96% Accuracy: It’s a ratio of correctly predicted observations to the total observations, which is 97%
  • 8.
    Conclusion : SVMVs Naïve Bays Ravi Nakulan 8 Naïve Bayes: We cannot suggest this model to Mr. John Huges. Explanation: Although, NB is the good starting point to develop a model because it is a simple algorithm in terms of overall run time and its quick even if we have 30-50 thousands lines of data but cannot be set with all algorithm, especially when we have multiple independent variables and they are interdependent to each other and accuracy of the prediction is highly imperative as required in Medical field. Highlights of the Results between NB & SVM • SVM got twice the 100% prediction; one under precision for cancer & under recall for no-cancer . • The lowest prediction score is 96% for SVM, which came from recall for cancer but still considered very high. • The f1-score which balances between precision & recall is better than NB for both cancer and no-cancer. • The overall weighted avg for SVM is 99% vs 97% for NB model, which indicates that SVM is better choice. SVM: SVM is the right choice for Mr. John Huges Explanation: SVM does not effected by numbers of independent variables and it does not care about their interrelationship, but with the larger dataset we can go from minutes to hours to days to weeks as a drawback of running SVM. However, this dataset is fairly small to run and therefore for Mr. John Hughes SVM is the right choice because it’s a popular medical model and we need highest degree of accuracy. SVM did able to find the better line between cancer and no-cancer rather than just trying to look for probability as NB base model did. As a conclusion, although NB has produced significantly very high 97% of probability but SVM has done better job by optimizing the model better than NB and found cancer and no-cancer with 99% prediction.