Wisconsin Breast Cancer Dataset Analysis

Wisconsin Breast Cancer
dataset
GUGC
Da Hee Kim
Advised by Homin Park

Contents
• Introduction
• Loading & checking the data
• Explanatory Data analysis (EDA)
• Feature Engineering
• Modeling
• Interpretability/ Explainable AI (XAI)
• Wrap-up
• Future research

1. Introduction
- The problem
Christina
Applegate
Sharon
Osbourne
Angelina
Jolie

1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases

1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases (B) Estimated Deaths

Cancer ≠Tumor: Abnormal growth of cells causing a mass of tissue
• Malignant tumors are
cancerous and invade other
sites.
• Benign tumors stay in their
primary location.
1. Introduction
- The problem

1. Introduction
- The problem: Diagnosis

1. Introduction
- The problem
Benign Malignant
• Nucleus size uniform
• Symmetrical
• Homogenous
• Areas within normal size
• Non uniform nucleus
• Asymmetrical
• Non homogenous sizes
• Areas above normal size

1. Introduction
- The problem: Diagnosis
Problem:
 Depending on type, painful to patient
 Potential side effects (ex: bruising)
 Diagnosis can take time
 Tedious process
Model Malignant
Benign
Machine learning
Imaging

1. Wisconsin Breast Cancer dataset
- What?
• Has parameters measured form a fine needle aspirate of a breast mass.
• The parameters are about the cell nucleus. (569 cells)
Cell Nuclei

- What?
• The parameters include the following 10 features
•Radius
•Texture
•Perimeter
•Area
•Smoothness
•Compactness (perimeter^2 / area - 1.0)
•Concavity (severity of concave portions of the contour)
•Concave points (number of concave portions of the contour)
•Symmetry
•Fractal dimension

- What?
•Texture: standard deviation of gray-scale values
•Smoothness: local variation in radius lengths
•Compactness: (perimeter^2 / area - 1.0)
•Concavity: severity of concave portions of the contour
•Concave points: Number of concave portions of the contour
•Symmetry: Uses nucleus deformation parameter to measure
how non-spherical a nucleus is.
• Fractal dimension: ("coastline approximation" - 1)
Radius
Perimeter
Area
Nuclei

- What? : Structure of dataset
Mean
Standard error
Worst
30 features
• ID
• diagnosis
33 Columns
Feature
10
• Unnamed

• Doctors will be able to determine whether the tumor is malignant or benign through imaging
without biopsy.
• Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is
difficult to conduct.
- Importance?
Model Malignant
Benign

1. Loading and checking data
1: Loading data
2: Checking for null values
3. Outlier detection
4. Summary and statistics

1. Loading and checking the data

ID & Diagnosis
Mean
Standard error
Worst
Unnamed

- 2: Checking for null values
• All the data in the Unnamed column consists of null
values
• We will thus remove this column later

- 3: Outlier Detection
• We will drop these later
• Redefine “X” which includes only the features
• Output: Outliers found depending on only the feature traits (x_col)

- 4: Summary and statistics
• We can observe the statistical values for each of the features
• Redefine “X” which includes only the features
• Output

- 4: Summary and statistics
• We can observe the quantity of each benign and malignant tumors
• Redefine “data_w_diag” which includes the diagnosis and the 30 features
• Output
Number of Benign: 357
Number of Malignant: 212

2. Explanatory Data Analysis
1: Heat Map all features
2: Important features
2-1: Radius VS Perimeter VS Area
1: Heat map
2-2: Compactness VS Concavity VS Compactness
1: Heat map
2: Feature plotting: Histogram
3: Overall data distribution

2. EDA
- 1: Heat Map
We can see a couple of relations using the heat map.
1. Within the mean or worst features, we can see that
radius is highly correlated to the perimeter and the area.
2. compactness, concavity and concave points are
corelated to each other.

2. EDA
- 2-1: Radius VS Perimeter VS Area
1. Heat map
• Different colors: 1.0 is due to the rounding
of correlation values.

2. EDA
-2-2: Compactness VS Concave points VS Concavity
1. Heat map
We will compare the following
features
• Compactness & concavity & concave points
 High correlation
• Compactness mean VS compactness worst
• Concavity mean VS concavity worst
• Concave points mean VS concave points
worst

2. EDA
-2-2: Compactness VS Concave points VS Concavity
2. Feature plotting: Worst VS Mean
• Concave points
• Concavity
• Compactness
• concavity & compactness:
worst ≒ mean
• Concave points:
worst ≠ mean
Similarity of overall distribution

2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Mean • Standard Error

2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Red box: Examples of features with good
separation
• Blue box: Examples of features with bad
separation
Assume that features with good separation
will have higher feature importance

3. Feature Engineering
1: Standardization
2: Outlier detection

3. Feature engineering
- 1. Standardization
Before
After

3. Feature engineering
- 2. Outlier Deletion: Swarm Plot
Before
After

4. Modeling
1: Splitting data
2: Classification
2-1: ANN
2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression
3: Cross validate models
4: Hyper parameter tuning
5: Evaluating models

4. Modeling
-1: Splitting data

4. Modeling
-2: Classification
• ANN
Parameters
• Hidden layers: 3
• Optimizer: ADAMS
• Learning rate: 0.003
• Epoch: 200

4. Modeling
-3: Cross validate models
• SVC VS Decision Tree VS Ada Boost VS Random forest VS Extra trees VS GBC VS Logistic regression

4. Modeling
-4: Hyperparameter tuning for the selected models
• SVC
• Extra trees • GBC
• Logistic regression

Model Accuracy Precision Recall F-1 Score
ANN 0.978 0.95 0.974 0.962
SVC classifier 0.992 0.974 1.000 0.987
Extra Trees
classifier
0.978 0.974 0.950 0.962
Gradient boosting
classifier
0.957 0.923 0.923 0.923
Logistic regression 0.985 0.974 0.974 0.974
4. Modeling
-5: Evaluating models
• Sensitivity (recall) is the most important aspect.
 less misclassifications of diagnosing a cancer patient as negative.
• SVC: has highest for all parameters.
 SVC is effective for data with clear distinction.
 The dataset may be too small for ANN to have best results.
• All models have a value greater than 0.9.
 Assume that there exists a feature which allows for good classification.

5. Explainable AI
1: SHAP
1-1: Summary plot
2: Permutation Feature Importance
3: Comparison (SHAP VS PFI)

5. Explainable AI
- 1: SHAP
1. Summary plot

5. Explainable AI
- 1: SHAP
3. Summary plot
• Concave point_worst has highest importance
• Features that had high correlation in EDA
• Concave points vs concavity vs compactness
 Concave points ≠ concavity ≠ compactness
• Area vs Perimeter vs radius
 Area_worst ≒ radius_worst ≠ perimeter_worst
 Area_mean ≠ radius_mean ≒ perimeter_mean

5. Explainable AI
- 2: Permeative feature importance

5. Explainable AI
- 3: Comparing Output
SHAP VS PFI
• Interactions: PFI calculates decrease in model performance when a feature
is permuted while SHAP values account for the interaction between featur
e & captures non linear relationships.
 For complex non-linear models, SHAP may provide better ranking
o We used ‘rbf’ as the kernel function. SHAP can handle the non-lin
earities  PFI may not totally capture the non linearities.
 Some features showed high interactions. In PFI this is not considered.
o may cause difference in importance
• Distribution differences: SHAP considers the distribution of the entire data
set while PFI focuses on the effect of permuting a feature individually.

Wrap-up
• We can determine whether a tumor is benign or malignant using SVC classifier with
high accuracy.
Implementation: No biopsy would be needed for diagnosis

Future work
• Methods for imaging to scale would be needed.
• There are other cancers such as kidney, in which biopsy is much more difficult.
Could use the breast cancer model as stepping stones for advancements in those
areas
o No biopsy would be needed, and diagnosis would be less tedious and painful
to the patients.
o Human labor can be reduced.

Wisconsin Breast Cancer Dataset Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Wisconsin Breast Cancer Dataset Analysis

Similar to Wisconsin Breast Cancer Dataset Analysis (20)

Recently uploaded

Recently uploaded (20)

Wisconsin Breast Cancer Dataset Analysis

Editor's Notes