SlideShare a Scribd company logo
1 of 46
Wisconsin Breast Cancer
dataset
GUGC
Da Hee Kim
Advised by Homin Park
Contents
• Introduction
• Loading & checking the data
• Explanatory Data analysis (EDA)
• Feature Engineering
• Modeling
• Interpretability/ Explainable AI (XAI)
• Wrap-up
• Future research
1. Importance
1. Introduction
- The problem
Christina
Applegate
Sharon
Osbourne
Angelina
Jolie
1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases
1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases (B) Estimated Deaths
Cancer ≠Tumor: Abnormal growth of cells causing a mass of tissue
• Malignant tumors are
cancerous and invade other
sites.
• Benign tumors stay in their
primary location.
1. Introduction
- The problem
1. Introduction
- The problem: Diagnosis
1. Introduction
- The problem
Benign Malignant
• Nucleus size uniform
• Symmetrical
• Homogenous
• Areas within normal size
• Non uniform nucleus
• Asymmetrical
• Non homogenous sizes
• Areas above normal size
1. Introduction
- The problem: Diagnosis
Problem:
 Depending on type, painful to patient
 Potential side effects (ex: bruising)
 Diagnosis can take time
 Tedious process
Model Malignant
Benign
Machine learning
Imaging
1. Wisconsin Breast Cancer dataset
- What?
• Has parameters measured form a fine needle aspirate of a breast mass.
• The parameters are about the cell nucleus. (569 cells)
Cell Nuclei
1. Wisconsin Breast Cancer dataset
- What?
• The parameters include the following 10 features
•Radius
•Texture
•Perimeter
•Area
•Smoothness
•Compactness (perimeter^2 / area - 1.0)
•Concavity (severity of concave portions of the contour)
•Concave points (number of concave portions of the contour)
•Symmetry
•Fractal dimension
1. Wisconsin Breast Cancer dataset
- What?
• The parameters include the following 10 features
•Texture: standard deviation of gray-scale values
•Smoothness: local variation in radius lengths
•Compactness: (perimeter^2 / area - 1.0)
•Concavity: severity of concave portions of the contour
•Concave points: Number of concave portions of the contour
•Symmetry: Uses nucleus deformation parameter to measure
how non-spherical a nucleus is.
• Fractal dimension: ("coastline approximation" - 1)
Radius
Perimeter
Area
Nuclei
1. Wisconsin Breast Cancer dataset
- What? : Structure of dataset
• The parameters include the following 10 features
Mean
Standard error
Worst
30 features
• ID
• diagnosis
33 Columns
Feature
10
• Unnamed
• Doctors will be able to determine whether the tumor is malignant or benign through imaging
without biopsy.
• Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is
difficult to conduct.
1. Wisconsin Breast Cancer dataset
- Importance?
Model Malignant
Benign
2. Machine learning process
1. Loading and checking data
1: Loading data
2: Checking for null values
3. Outlier detection
4. Summary and statistics
1. Loading and checking the data
1. Loading and checking the data
ID & Diagnosis
Mean
Standard error
Worst
Unnamed
1. Loading and checking the data
- 2: Checking for null values
• All the data in the Unnamed column consists of null
values
• We will thus remove this column later
1. Loading and checking the data
- 3: Outlier Detection
• We will drop these later
• Redefine “X” which includes only the features
• Output: Outliers found depending on only the feature traits (x_col)
1. Loading and checking the data
- 4: Summary and statistics
• We can observe the statistical values for each of the features
• Redefine “X” which includes only the features
• Output
1. Loading and checking the data
- 4: Summary and statistics
• We can observe the quantity of each benign and malignant tumors
• Redefine “data_w_diag” which includes the diagnosis and the 30 features
• Output
Number of Benign: 357
Number of Malignant: 212
2. Explanatory Data Analysis
1: Heat Map all features
2: Important features
2-1: Radius VS Perimeter VS Area
1: Heat map
2-2: Compactness VS Concavity VS Compactness
1: Heat map
2: Feature plotting: Histogram
3: Overall data distribution
2. EDA
- 1: Heat Map
We can see a couple of relations using the heat map.
1. Within the mean or worst features, we can see that
radius is highly correlated to the perimeter and the area.
2. compactness, concavity and concave points are
corelated to each other.
2. EDA
- 2-1: Radius VS Perimeter VS Area
1. Heat map
• Different colors: 1.0 is due to the rounding
of correlation values.
2. EDA
-2-2: Compactness VS Concave points VS Concavity
1. Heat map
We will compare the following
features
• Compactness & concavity & concave points
 High correlation
• Compactness mean VS compactness worst
• Concavity mean VS concavity worst
• Concave points mean VS concave points
worst
2. EDA
-2-2: Compactness VS Concave points VS Concavity
2. Feature plotting: Worst VS Mean
• Concave points
• Concavity
• Compactness
• concavity & compactness:
worst ≒ mean
• Concave points:
worst ≠ mean
Similarity of overall distribution
2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Mean • Standard Error
2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Red box: Examples of features with good
separation
• Blue box: Examples of features with bad
separation
Assume that features with good separation
will have higher feature importance
3. Feature Engineering
1: Standardization
2: Outlier detection
3. Feature engineering
- 1. Standardization
Before
After
3. Feature engineering
- 2. Outlier Deletion: Swarm Plot
Before
After
4. Modeling
1: Splitting data
2: Classification
2-1: ANN
2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression
3: Cross validate models
4: Hyper parameter tuning
5: Evaluating models
4. Modeling
-1: Splitting data
4. Modeling
-2: Classification
• ANN
Parameters
• Hidden layers: 3
• Optimizer: ADAMS
• Learning rate: 0.003
• Epoch: 200
4. Modeling
-3: Cross validate models
• SVC VS Decision Tree VS Ada Boost VS Random forest VS Extra trees VS GBC VS Logistic regression
4. Modeling
-4: Hyperparameter tuning for the selected models
• SVC
• Extra trees • GBC
• Logistic regression
Model Accuracy Precision Recall F-1 Score
ANN 0.978 0.95 0.974 0.962
SVC classifier 0.992 0.974 1.000 0.987
Extra Trees
classifier
0.978 0.974 0.950 0.962
Gradient boosting
classifier
0.957 0.923 0.923 0.923
Logistic regression 0.985 0.974 0.974 0.974
4. Modeling
-5: Evaluating models
• Sensitivity (recall) is the most important aspect.
 less misclassifications of diagnosing a cancer patient as negative.
• SVC: has highest for all parameters.
 SVC is effective for data with clear distinction.
 The dataset may be too small for ANN to have best results.
• All models have a value greater than 0.9.
 Assume that there exists a feature which allows for good classification.
5. Explainable AI
1: SHAP
1-1: Summary plot
2: Permutation Feature Importance
3: Comparison (SHAP VS PFI)
5. Explainable AI
- 1: SHAP
1. Summary plot
5. Explainable AI
- 1: SHAP
3. Summary plot
• Concave point_worst has highest importance
• Features that had high correlation in EDA
• Concave points vs concavity vs compactness
 Concave points ≠ concavity ≠ compactness
• Area vs Perimeter vs radius
 Area_worst ≒ radius_worst ≠ perimeter_worst
 Area_mean ≠ radius_mean ≒ perimeter_mean
5. Explainable AI
- 2: Permeative feature importance
5. Explainable AI
- 3: Comparing Output
SHAP VS PFI
• Interactions: PFI calculates decrease in model performance when a feature
is permuted while SHAP values account for the interaction between featur
e & captures non linear relationships.
 For complex non-linear models, SHAP may provide better ranking
o We used ‘rbf’ as the kernel function. SHAP can handle the non-lin
earities  PFI may not totally capture the non linearities.
 Some features showed high interactions. In PFI this is not considered.
o may cause difference in importance
• Distribution differences: SHAP considers the distribution of the entire data
set while PFI focuses on the effect of permuting a feature individually.
Wrap-up
• We can determine whether a tumor is benign or malignant using SVC classifier with
high accuracy.
Implementation: No biopsy would be needed for diagnosis
Future work
• Methods for imaging to scale would be needed.
• There are other cancers such as kidney, in which biopsy is much more difficult.
Could use the breast cancer model as stepping stones for advancements in those
areas
o No biopsy would be needed, and diagnosis would be less tedious and painful
to the patients.
o Human labor can be reduced.

More Related Content

What's hot

Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxkumari36
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentationDharmesh Patel
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET Journal
 
Breast Cancer Detection with Convolutional Neural Networks (CNN)
Breast Cancer Detection with Convolutional Neural Networks (CNN)Breast Cancer Detection with Convolutional Neural Networks (CNN)
Breast Cancer Detection with Convolutional Neural Networks (CNN)Mehmet Çağrı Aksoy
 
Neural Network Based Brain Tumor Detection using MR Images
Neural Network Based Brain Tumor Detection using MR ImagesNeural Network Based Brain Tumor Detection using MR Images
Neural Network Based Brain Tumor Detection using MR ImagesAisha Kalsoom
 
Brain tumor detection using convolutional neural network
Brain tumor detection using convolutional neural network Brain tumor detection using convolutional neural network
Brain tumor detection using convolutional neural network MD Abdullah Al Nasim
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...
 DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI... DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...
DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...Nexgen Technology
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
 

What's hot (20)

Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Arithmetic Coding
Arithmetic CodingArithmetic Coding
Arithmetic Coding
 
Prediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptxPrediction of heart disease using machine learning.pptx
Prediction of heart disease using machine learning.pptx
 
Clustering
ClusteringClustering
Clustering
 
Learning from imbalanced data
Learning from imbalanced data Learning from imbalanced data
Learning from imbalanced data
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentation
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 
Object detection
Object detectionObject detection
Object detection
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
Breast Cancer Detection with Convolutional Neural Networks (CNN)
Breast Cancer Detection with Convolutional Neural Networks (CNN)Breast Cancer Detection with Convolutional Neural Networks (CNN)
Breast Cancer Detection with Convolutional Neural Networks (CNN)
 
Neural Network Based Brain Tumor Detection using MR Images
Neural Network Based Brain Tumor Detection using MR ImagesNeural Network Based Brain Tumor Detection using MR Images
Neural Network Based Brain Tumor Detection using MR Images
 
Brain tumor detection using convolutional neural network
Brain tumor detection using convolutional neural network Brain tumor detection using convolutional neural network
Brain tumor detection using convolutional neural network
 
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...
 DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI... DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...
DISEASE PREDICTION BY MACHINE LEARNING OVER BIG DATA FROM HEALTHCARE COMMUNI...
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Final ppt
Final pptFinal ppt
Final ppt
 

Similar to Wisconsin Breast Cancer Dataset Analysis

Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slidespannicle
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
Quantitative Cancer Image Analysis
Quantitative Cancer Image AnalysisQuantitative Cancer Image Analysis
Quantitative Cancer Image AnalysisWookjin Choi
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Artificial Intelligence in Radiation Oncology.pptx
 Artificial Intelligence in Radiation Oncology.pptx Artificial Intelligence in Radiation Oncology.pptx
Artificial Intelligence in Radiation Oncology.pptxWookjin Choi
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
County-Level Corn Yield Prediction with GeoAI.pdf
County-Level Corn Yield Prediction with GeoAI.pdfCounty-Level Corn Yield Prediction with GeoAI.pdf
County-Level Corn Yield Prediction with GeoAI.pdfDavidataLiao
 
Introduction to sampling
Introduction to samplingIntroduction to sampling
Introduction to samplingSituo Liu
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Lionel Briand
 
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...Joel Saltz
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Artificial Intelligence in Radiation Oncology
Artificial Intelligence in Radiation OncologyArtificial Intelligence in Radiation Oncology
Artificial Intelligence in Radiation OncologyWookjin Choi
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesManohar Pahan
 

Similar to Wisconsin Breast Cancer Dataset Analysis (20)

Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Vanderbilt b
Vanderbilt bVanderbilt b
Vanderbilt b
 
Quantitative Cancer Image Analysis
Quantitative Cancer Image AnalysisQuantitative Cancer Image Analysis
Quantitative Cancer Image Analysis
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
CAD v2
CAD v2CAD v2
CAD v2
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Artificial Intelligence in Radiation Oncology.pptx
 Artificial Intelligence in Radiation Oncology.pptx Artificial Intelligence in Radiation Oncology.pptx
Artificial Intelligence in Radiation Oncology.pptx
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?
 
County-Level Corn Yield Prediction with GeoAI.pdf
County-Level Corn Yield Prediction with GeoAI.pdfCounty-Level Corn Yield Prediction with GeoAI.pdf
County-Level Corn Yield Prediction with GeoAI.pdf
 
Introduction to sampling
Introduction to samplingIntroduction to sampling
Introduction to sampling
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
TESCO Evaluation of Non-Normal Meter Data
TESCO Evaluation of Non-Normal Meter DataTESCO Evaluation of Non-Normal Meter Data
TESCO Evaluation of Non-Normal Meter Data
 
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...Tools to Analyze Morphology and Spatially Mapped Molecular Data -  Informatio...
Tools to Analyze Morphology and Spatially Mapped Molecular Data - Informatio...
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Artificial Intelligence in Radiation Oncology
Artificial Intelligence in Radiation OncologyArtificial Intelligence in Radiation Oncology
Artificial Intelligence in Radiation Oncology
 
Factor Analysis for Exploratory Studies
Factor Analysis for Exploratory StudiesFactor Analysis for Exploratory Studies
Factor Analysis for Exploratory Studies
 

Recently uploaded

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 

Recently uploaded (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

Wisconsin Breast Cancer Dataset Analysis

  • 1. Wisconsin Breast Cancer dataset GUGC Da Hee Kim Advised by Homin Park
  • 2. Contents • Introduction • Loading & checking the data • Explanatory Data analysis (EDA) • Feature Engineering • Modeling • Interpretability/ Explainable AI (XAI) • Wrap-up • Future research
  • 4. 1. Introduction - The problem Christina Applegate Sharon Osbourne Angelina Jolie
  • 5. 1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea) (A) Estimated new cases
  • 6. 1. Introduction - The problem The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea) (A) Estimated new cases (B) Estimated Deaths
  • 7. Cancer ≠Tumor: Abnormal growth of cells causing a mass of tissue • Malignant tumors are cancerous and invade other sites. • Benign tumors stay in their primary location. 1. Introduction - The problem
  • 8. 1. Introduction - The problem: Diagnosis
  • 9. 1. Introduction - The problem Benign Malignant • Nucleus size uniform • Symmetrical • Homogenous • Areas within normal size • Non uniform nucleus • Asymmetrical • Non homogenous sizes • Areas above normal size
  • 10. 1. Introduction - The problem: Diagnosis Problem:  Depending on type, painful to patient  Potential side effects (ex: bruising)  Diagnosis can take time  Tedious process Model Malignant Benign Machine learning Imaging
  • 11. 1. Wisconsin Breast Cancer dataset - What? • Has parameters measured form a fine needle aspirate of a breast mass. • The parameters are about the cell nucleus. (569 cells) Cell Nuclei
  • 12. 1. Wisconsin Breast Cancer dataset - What? • The parameters include the following 10 features •Radius •Texture •Perimeter •Area •Smoothness •Compactness (perimeter^2 / area - 1.0) •Concavity (severity of concave portions of the contour) •Concave points (number of concave portions of the contour) •Symmetry •Fractal dimension
  • 13. 1. Wisconsin Breast Cancer dataset - What? • The parameters include the following 10 features •Texture: standard deviation of gray-scale values •Smoothness: local variation in radius lengths •Compactness: (perimeter^2 / area - 1.0) •Concavity: severity of concave portions of the contour •Concave points: Number of concave portions of the contour •Symmetry: Uses nucleus deformation parameter to measure how non-spherical a nucleus is. • Fractal dimension: ("coastline approximation" - 1) Radius Perimeter Area Nuclei
  • 14. 1. Wisconsin Breast Cancer dataset - What? : Structure of dataset • The parameters include the following 10 features Mean Standard error Worst 30 features • ID • diagnosis 33 Columns Feature 10 • Unnamed
  • 15. • Doctors will be able to determine whether the tumor is malignant or benign through imaging without biopsy. • Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is difficult to conduct. 1. Wisconsin Breast Cancer dataset - Importance? Model Malignant Benign
  • 17. 1. Loading and checking data 1: Loading data 2: Checking for null values 3. Outlier detection 4. Summary and statistics
  • 18. 1. Loading and checking the data
  • 19. 1. Loading and checking the data ID & Diagnosis Mean Standard error Worst Unnamed
  • 20. 1. Loading and checking the data - 2: Checking for null values • All the data in the Unnamed column consists of null values • We will thus remove this column later
  • 21. 1. Loading and checking the data - 3: Outlier Detection • We will drop these later • Redefine “X” which includes only the features • Output: Outliers found depending on only the feature traits (x_col)
  • 22. 1. Loading and checking the data - 4: Summary and statistics • We can observe the statistical values for each of the features • Redefine “X” which includes only the features • Output
  • 23. 1. Loading and checking the data - 4: Summary and statistics • We can observe the quantity of each benign and malignant tumors • Redefine “data_w_diag” which includes the diagnosis and the 30 features • Output Number of Benign: 357 Number of Malignant: 212
  • 24. 2. Explanatory Data Analysis 1: Heat Map all features 2: Important features 2-1: Radius VS Perimeter VS Area 1: Heat map 2-2: Compactness VS Concavity VS Compactness 1: Heat map 2: Feature plotting: Histogram 3: Overall data distribution
  • 25. 2. EDA - 1: Heat Map We can see a couple of relations using the heat map. 1. Within the mean or worst features, we can see that radius is highly correlated to the perimeter and the area. 2. compactness, concavity and concave points are corelated to each other.
  • 26. 2. EDA - 2-1: Radius VS Perimeter VS Area 1. Heat map • Different colors: 1.0 is due to the rounding of correlation values.
  • 27. 2. EDA -2-2: Compactness VS Concave points VS Concavity 1. Heat map We will compare the following features • Compactness & concavity & concave points  High correlation • Compactness mean VS compactness worst • Concavity mean VS concavity worst • Concave points mean VS concave points worst
  • 28. 2. EDA -2-2: Compactness VS Concave points VS Concavity 2. Feature plotting: Worst VS Mean • Concave points • Concavity • Compactness • concavity & compactness: worst ≒ mean • Concave points: worst ≠ mean Similarity of overall distribution
  • 29. 2. EDA - 3: Data distribution • Violin Plot • Worst • Mean • Standard Error
  • 30. 2. EDA - 3: Data distribution • Violin Plot • Worst • Red box: Examples of features with good separation • Blue box: Examples of features with bad separation Assume that features with good separation will have higher feature importance
  • 31. 3. Feature Engineering 1: Standardization 2: Outlier detection
  • 32. 3. Feature engineering - 1. Standardization Before After
  • 33. 3. Feature engineering - 2. Outlier Deletion: Swarm Plot Before After
  • 34. 4. Modeling 1: Splitting data 2: Classification 2-1: ANN 2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression 3: Cross validate models 4: Hyper parameter tuning 5: Evaluating models
  • 36. 4. Modeling -2: Classification • ANN Parameters • Hidden layers: 3 • Optimizer: ADAMS • Learning rate: 0.003 • Epoch: 200
  • 37. 4. Modeling -3: Cross validate models • SVC VS Decision Tree VS Ada Boost VS Random forest VS Extra trees VS GBC VS Logistic regression
  • 38. 4. Modeling -4: Hyperparameter tuning for the selected models • SVC • Extra trees • GBC • Logistic regression
  • 39. Model Accuracy Precision Recall F-1 Score ANN 0.978 0.95 0.974 0.962 SVC classifier 0.992 0.974 1.000 0.987 Extra Trees classifier 0.978 0.974 0.950 0.962 Gradient boosting classifier 0.957 0.923 0.923 0.923 Logistic regression 0.985 0.974 0.974 0.974 4. Modeling -5: Evaluating models • Sensitivity (recall) is the most important aspect.  less misclassifications of diagnosing a cancer patient as negative. • SVC: has highest for all parameters.  SVC is effective for data with clear distinction.  The dataset may be too small for ANN to have best results. • All models have a value greater than 0.9.  Assume that there exists a feature which allows for good classification.
  • 40. 5. Explainable AI 1: SHAP 1-1: Summary plot 2: Permutation Feature Importance 3: Comparison (SHAP VS PFI)
  • 41. 5. Explainable AI - 1: SHAP 1. Summary plot
  • 42. 5. Explainable AI - 1: SHAP 3. Summary plot • Concave point_worst has highest importance • Features that had high correlation in EDA • Concave points vs concavity vs compactness  Concave points ≠ concavity ≠ compactness • Area vs Perimeter vs radius  Area_worst ≒ radius_worst ≠ perimeter_worst  Area_mean ≠ radius_mean ≒ perimeter_mean
  • 43. 5. Explainable AI - 2: Permeative feature importance
  • 44. 5. Explainable AI - 3: Comparing Output SHAP VS PFI • Interactions: PFI calculates decrease in model performance when a feature is permuted while SHAP values account for the interaction between featur e & captures non linear relationships.  For complex non-linear models, SHAP may provide better ranking o We used ‘rbf’ as the kernel function. SHAP can handle the non-lin earities  PFI may not totally capture the non linearities.  Some features showed high interactions. In PFI this is not considered. o may cause difference in importance • Distribution differences: SHAP considers the distribution of the entire data set while PFI focuses on the effect of permuting a feature individually.
  • 45. Wrap-up • We can determine whether a tumor is benign or malignant using SVC classifier with high accuracy. Implementation: No biopsy would be needed for diagnosis
  • 46. Future work • Methods for imaging to scale would be needed. • There are other cancers such as kidney, in which biopsy is much more difficult. Could use the breast cancer model as stepping stones for advancements in those areas o No biopsy would be needed, and diagnosis would be less tedious and painful to the patients. o Human labor can be reduced.

Editor's Notes

  1. But actually, many other celebrities went through this process after finding out they had the mutated version of the gene. So, just by looking at this, we can see that breast cancer is very common. This can be proved statistically by looking at the following diagram.
  2. https://doi.org/10.4143/crt.2020.203 This is a diagram that shows the type of cancer that feales are diagnosed with in south korea 2020. We can see that 24.7 % is breast cancer. And this means that since, there are about 10 people in this classroom, assuming we are all female, 2-3 people will get breast cancer.
  3. Now lets look at the mortaility rate of each cancer. We can see that although breast cancer was 1st for diagnosis, it was not for deaths. Then why?? Although this may be due to the fact that diagnosis of breast cancer is easier than for instance lung cancer, it is also because once diagnosed, the treatment is quite well developed for some types. Thus we can see two things here. First is that breast cancer is common in women, and second, that diagnosis is essential when treating the disease. Then how? Are these tumors diagnosised?
  4. Then how is a tumor diagnosed as cancerous or non-cancerous? In most cases, tumors are self diagnosed or found through annual checkups. This is because breast tumors can be self diagnosed by hand. Thus, the important part of diagnosis is checking whether the tumor is benign or malignat. As you can see, benign tumors stay in their primary locan, whereas malignat tumors invade other areas. Dataset 설먕 할때 +> feature 설명도 해. => Wisconsion website
  5. This process is done traditionally through biopsy. We first take a piece of the tumor tissue in the breast and diagnose by looking at the sample under the microscope. 유ㅠ방암 진단 발생 과정 => benign to malignant graphical representation https://www.mypathologyreport.ca/ko/pathology-dictionary/biopsy/ https://doi.org/10.1016/j.jmoldx.2021.01.006
  6. This is an example that can be seen under the microscope. As you can see, benign tumors are reletivaly round and symmetric in shape, whereas malignant tumors are irregular and has a big nucleus. The diagnosis is done by eye due to these clear distinguishing features. 유ㅠ방암 진단 발생 과정 => benign to malignant graphical representation https://doi.org/10.1038/s41598-022-19278-2 화살표로 보여라 with diff col (symm, normal,
  7. So then what is the problem with this type of diagnosis? Why do we need machine learning classification methods? Well this is because, biopsy can have a couple of backsides. It is painful to the patient and is a tedius process. If we use machine learning, we will be able to diagnose malignant tumors just by getting features from screening, wthus /ithout biopsy. 유ㅠ방암 진단 Invasive method. Biopsy 발생 과정 => benign to malignant graphical representation https://www.mypathologyreport.ca/ko/pathology-dictionary/biopsy/ https://doi.org/10.1016/j.jmoldx.2021.01.006
  8. Then what data do we use to train this model? We are going to use the Wisconsion breast cancer dataset. This has information about the nuclei of the cell which is obtained by biopsy. Dataset 설먕 할때 +> feature 설명도 해. Wisconsion website Add pg 15 immunohistochemistry => how was the data made? Add images with malignant and bemign tumors => can see how the answers of the dataset was made. Biopsy 어려운 암들이 있다. => image data diagnosis helpful IMPORTANDCE => noninvasive method.
  9. The data insclues the following features. I will look at each one by one briefly.
  10. Lets assume this circle is the nucleus of a cell. The radius will be this part, the perimer and the area will be here. The texture is standard deviation of grey-scale values. This indicates a score of the grewy and white areas of the cell. The smoothness values are local variations in radius lengths. The compactness is calculated with the perimeter and the area. The concavity is the severity of the concave portions of the contour The concave points is the number of points Symmetery values Finally frational dimension is found with the coastline approximation
  11. Lets assume the shape on the left is a tumor nuclei. We see that the outside is uneven. Here we can measure the mean radius, and also the worst radius. We can then calculate the standard error using the equation above. Therefore, there are three measurements for a particular feature. This type of measurement is seen for all features. Therefoe,
  12. As you can see here, since there are 10 distinct featues, we end up with 30 different measurements. There is also the ID and the diagnosis and we end up with 32 colunms.
  13. Therefoere, we will be able to diagnose cancer by obtaing different morphological characteristics through imaging.
  14. Now we will look through the machnine learning process.
  15. First, we will load and check the data.
  16. We first check the data. As stated before, there is the ID, diagnosis, the 30 features, and the Unnamed column.
  17. WAIT
  18. We then check for null values. We see that the Unnamed column consist of only null values. We will remove them later.
  19. Then for the outlier detection. We will drop them later on.
  20. We also check the summary and the statistics of the numerical data.
  21. Finally, we se the number of each Malignant and benign subjects.
  22. Next is explanatory data analysis.
  23. This is a clustered heat map. We can observe two main things. First we see that the area, perimeter and radius are highly correlated. Also we can see that compactness and concave points and concativy are related to each other. Since there are too many features, we will look at separate ones for each case.
  24. Se 는 없애되 됨. => Coefficeint 모시기 할대
  25. Se 는 없애되 됨. => Coefficeint 모시기 할대
  26. Histogram => y axis no meaning Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
  27. Histogram => y axis no meaning Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
  28. Histogram => y axis no meaning Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
  29. Histogram => y axis no meaning Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
  30. Intuitively: Tumor burden & malignancy => area is real area found in image
  31. Next is a heat map about the compactness concavity and concave points. The slots with greater value than 0.9. we will visualize the correlation of the three features
  32. Next is a heat map about the compactness concavity and concave points. The slots with greater value than 0.9. we will visualize the correlation of the three features
  33. Here, we have graphed the points with each. The personr value is the correlation value. We can see that it ranges from 0.86 to 0.92. Thus we can see that they have a high correlation. The 0.92 value between concavity and concave points is because they are defined with relation to one another.
  34. Here, we have graphed the points with each. The personr value is the correlation value. We can see that it ranges from 0.86 to 0.92. Thus we can see that they have a high correlation. The 0.92 value between concavity and concave points is because they are defined with relation to one another.
  35. Next we will look at the distribution the features. We will look at the worst features.
  36. Next we will look at the distribution the features. We will look at the worst features.
  37. Here we can see that although there are features with a clear distinction like those in red, we can see others. Like the blue box, to see less distribution.
  38. Now we will look at feature engineereing.
  39. Next we datadardize the data. this is done by using the standard scaler and is done to unifiy the units. G score transformation To 맞춰 units => 편향되지 않게
  40. We first delete the outlies detected in the first step.
  41. Next is the modeling step. We are going to use 8 different models. ANN, SVM, Decision tree, ada boost, random forest, and extra trees.
  42. Next is the modeling step. We are going to use 8 different models. ANN, SVM, Decision tree, ada boost, random forest, and extra trees.
  43. Since our data does not have separate testing and training data, we need to seperatre them. We do this with a 70 to 30 ratio.
  44. Now for ANN modeling. We obtimized the parameters as seen here. There are 4 hidden layers and the learning rate is 0.001 and epoch is 200. After trial, I found this to be the best results.
  45. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  46. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  47. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  48. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  49. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  50. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  51. In order to find models to compare with, I cross validated 7 different models. The cross validation score indicates how well the model will evaluate unseen data. Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset. I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
  52. We can see that SVC classifier has the highest values. Some potential reasons for this is the following. Due to the working mechanism, SVC is effective for data with clear distinction. Also the dataset may be reletavily small for ANN to have the best results.
  53. Finally we will look at Explinable AI models. In particualer, SHAP and Permutation feature importance.
  54. This is the force plot with sample order by similarity. The red are features that contribute to classifiying as malignant, and the blue are those that contribute as classification as benign.
  55. This is the force plot with sample order by similarity. The red are features that contribute to classifiying as malignant, and the blue are those that contribute as classification as benign.
  56. We will look at two different examples. The one on the top is about number
  57. 해석 의학적 => why area worst and radius worst important in finding whether M or B
  58. 해석 의학적 => why area worst and radius worst important in finding whether M or B
  59. 전체적으로 월했는지 1-2 + 여기서 다루지 못했돈고 + future work: 개선 점 + 이거저거 하면