This document discusses using machine learning models to classify breast cancer tumors as benign or malignant based on cell nucleus characteristics without biopsy. It summarizes loading and preprocessing the Wisconsin Breast Cancer dataset, performing explanatory data analysis to identify important features, engineering features, training classifiers including SVC and evaluating models. SHAP and permutation feature importance analysis identified concave point characteristics as most important for classification. The top performing SVC classifier achieved over 99% accuracy, allowing diagnosis without biopsy. Future work could apply these methods to other cancers where biopsy is difficult.
5. 1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases
6. 1. Introduction
- The problem
The 10 leading types of estimated new cancer cases and deaths in 2020. (South Korea)
(A) Estimated new cases (B) Estimated Deaths
7. Cancer ≠Tumor: Abnormal growth of cells causing a mass of tissue
• Malignant tumors are
cancerous and invade other
sites.
• Benign tumors stay in their
primary location.
1. Introduction
- The problem
9. 1. Introduction
- The problem
Benign Malignant
• Nucleus size uniform
• Symmetrical
• Homogenous
• Areas within normal size
• Non uniform nucleus
• Asymmetrical
• Non homogenous sizes
• Areas above normal size
10. 1. Introduction
- The problem: Diagnosis
Problem:
Depending on type, painful to patient
Potential side effects (ex: bruising)
Diagnosis can take time
Tedious process
Model Malignant
Benign
Machine learning
Imaging
11. 1. Wisconsin Breast Cancer dataset
- What?
• Has parameters measured form a fine needle aspirate of a breast mass.
• The parameters are about the cell nucleus. (569 cells)
Cell Nuclei
12. 1. Wisconsin Breast Cancer dataset
- What?
• The parameters include the following 10 features
•Radius
•Texture
•Perimeter
•Area
•Smoothness
•Compactness (perimeter^2 / area - 1.0)
•Concavity (severity of concave portions of the contour)
•Concave points (number of concave portions of the contour)
•Symmetry
•Fractal dimension
13. 1. Wisconsin Breast Cancer dataset
- What?
• The parameters include the following 10 features
•Texture: standard deviation of gray-scale values
•Smoothness: local variation in radius lengths
•Compactness: (perimeter^2 / area - 1.0)
•Concavity: severity of concave portions of the contour
•Concave points: Number of concave portions of the contour
•Symmetry: Uses nucleus deformation parameter to measure
how non-spherical a nucleus is.
• Fractal dimension: ("coastline approximation" - 1)
Radius
Perimeter
Area
Nuclei
14. 1. Wisconsin Breast Cancer dataset
- What? : Structure of dataset
• The parameters include the following 10 features
Mean
Standard error
Worst
30 features
• ID
• diagnosis
33 Columns
Feature
10
• Unnamed
15. • Doctors will be able to determine whether the tumor is malignant or benign through imaging
without biopsy.
• Breakthroughs with breast cancer can act as a steppingstone for other cancers where biopsy is
difficult to conduct.
1. Wisconsin Breast Cancer dataset
- Importance?
Model Malignant
Benign
19. 1. Loading and checking the data
ID & Diagnosis
Mean
Standard error
Worst
Unnamed
20. 1. Loading and checking the data
- 2: Checking for null values
• All the data in the Unnamed column consists of null
values
• We will thus remove this column later
21. 1. Loading and checking the data
- 3: Outlier Detection
• We will drop these later
• Redefine “X” which includes only the features
• Output: Outliers found depending on only the feature traits (x_col)
22. 1. Loading and checking the data
- 4: Summary and statistics
• We can observe the statistical values for each of the features
• Redefine “X” which includes only the features
• Output
23. 1. Loading and checking the data
- 4: Summary and statistics
• We can observe the quantity of each benign and malignant tumors
• Redefine “data_w_diag” which includes the diagnosis and the 30 features
• Output
Number of Benign: 357
Number of Malignant: 212
24. 2. Explanatory Data Analysis
1: Heat Map all features
2: Important features
2-1: Radius VS Perimeter VS Area
1: Heat map
2-2: Compactness VS Concavity VS Compactness
1: Heat map
2: Feature plotting: Histogram
3: Overall data distribution
25. 2. EDA
- 1: Heat Map
We can see a couple of relations using the heat map.
1. Within the mean or worst features, we can see that
radius is highly correlated to the perimeter and the area.
2. compactness, concavity and concave points are
corelated to each other.
26. 2. EDA
- 2-1: Radius VS Perimeter VS Area
1. Heat map
• Different colors: 1.0 is due to the rounding
of correlation values.
27. 2. EDA
-2-2: Compactness VS Concave points VS Concavity
1. Heat map
We will compare the following
features
• Compactness & concavity & concave points
High correlation
• Compactness mean VS compactness worst
• Concavity mean VS concavity worst
• Concave points mean VS concave points
worst
28. 2. EDA
-2-2: Compactness VS Concave points VS Concavity
2. Feature plotting: Worst VS Mean
• Concave points
• Concavity
• Compactness
• concavity & compactness:
worst ≒ mean
• Concave points:
worst ≠ mean
Similarity of overall distribution
29. 2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Mean • Standard Error
30. 2. EDA
- 3: Data distribution
• Violin Plot
• Worst
• Red box: Examples of features with good
separation
• Blue box: Examples of features with bad
separation
Assume that features with good separation
will have higher feature importance
34. 4. Modeling
1: Splitting data
2: Classification
2-1: ANN
2-2: SVC vs Decision Tree vs Ada Boost vs Random forest vs Extra trees vs GBC vs Logistic regression
3: Cross validate models
4: Hyper parameter tuning
5: Evaluating models
39. Model Accuracy Precision Recall F-1 Score
ANN 0.978 0.95 0.974 0.962
SVC classifier 0.992 0.974 1.000 0.987
Extra Trees
classifier
0.978 0.974 0.950 0.962
Gradient boosting
classifier
0.957 0.923 0.923 0.923
Logistic regression 0.985 0.974 0.974 0.974
4. Modeling
-5: Evaluating models
• Sensitivity (recall) is the most important aspect.
less misclassifications of diagnosing a cancer patient as negative.
• SVC: has highest for all parameters.
SVC is effective for data with clear distinction.
The dataset may be too small for ANN to have best results.
• All models have a value greater than 0.9.
Assume that there exists a feature which allows for good classification.
40. 5. Explainable AI
1: SHAP
1-1: Summary plot
2: Permutation Feature Importance
3: Comparison (SHAP VS PFI)
42. 5. Explainable AI
- 1: SHAP
3. Summary plot
• Concave point_worst has highest importance
• Features that had high correlation in EDA
• Concave points vs concavity vs compactness
Concave points ≠ concavity ≠ compactness
• Area vs Perimeter vs radius
Area_worst ≒ radius_worst ≠ perimeter_worst
Area_mean ≠ radius_mean ≒ perimeter_mean
44. 5. Explainable AI
- 3: Comparing Output
SHAP VS PFI
• Interactions: PFI calculates decrease in model performance when a feature
is permuted while SHAP values account for the interaction between featur
e & captures non linear relationships.
For complex non-linear models, SHAP may provide better ranking
o We used ‘rbf’ as the kernel function. SHAP can handle the non-lin
earities PFI may not totally capture the non linearities.
Some features showed high interactions. In PFI this is not considered.
o may cause difference in importance
• Distribution differences: SHAP considers the distribution of the entire data
set while PFI focuses on the effect of permuting a feature individually.
45. Wrap-up
• We can determine whether a tumor is benign or malignant using SVC classifier with
high accuracy.
Implementation: No biopsy would be needed for diagnosis
46. Future work
• Methods for imaging to scale would be needed.
• There are other cancers such as kidney, in which biopsy is much more difficult.
Could use the breast cancer model as stepping stones for advancements in those
areas
o No biopsy would be needed, and diagnosis would be less tedious and painful
to the patients.
o Human labor can be reduced.
Editor's Notes
But actually, many other celebrities went through this process after finding out they had the mutated version of the gene.
So, just by looking at this, we can see that breast cancer is very common.
This can be proved statistically by looking at the following diagram.
https://doi.org/10.4143/crt.2020.203
This is a diagram that shows the type of cancer that feales are diagnosed with in south korea 2020.
We can see that 24.7 % is breast cancer. And this means that since, there are about 10 people in this classroom, assuming we are all female, 2-3 people will get breast cancer.
Now lets look at the mortaility rate of each cancer. We can see that although breast cancer was 1st for diagnosis, it was not for deaths. Then why??
Although this may be due to the fact that diagnosis of breast cancer is easier than for instance lung cancer, it is also because once diagnosed, the treatment is quite well developed for some types.
Thus we can see two things here. First is that breast cancer is common in women, and second, that diagnosis is essential when treating the disease.
Then how? Are these tumors diagnosised?
Then how is a tumor diagnosed as cancerous or non-cancerous? In most cases, tumors are self diagnosed or found through annual checkups. This is because breast tumors can be self diagnosed by hand.
Thus, the important part of diagnosis is checking whether the tumor is benign or malignat. As you can see, benign tumors stay in their primary locan, whereas malignat tumors invade other areas.
Dataset 설먕 할때 +> feature 설명도 해.
=> Wisconsion website
This process is done traditionally through biopsy.
We first take a piece of the tumor tissue in the breast and diagnose by looking at the sample under the microscope.
유ㅠ방암 진단
발생 과정 => benign to malignant graphical representation
https://www.mypathologyreport.ca/ko/pathology-dictionary/biopsy/
https://doi.org/10.1016/j.jmoldx.2021.01.006
This is an example that can be seen under the microscope.
As you can see, benign tumors are reletivaly round and symmetric in shape, whereas malignant tumors are irregular and has a big nucleus.
The diagnosis is done by eye due to these clear distinguishing features.
유ㅠ방암 진단
발생 과정 => benign to malignant graphical representation
https://doi.org/10.1038/s41598-022-19278-2
화살표로 보여라 with diff col (symm, normal,
So then what is the problem with this type of diagnosis?
Why do we need machine learning classification methods?
Well this is because, biopsy can have a couple of backsides. It is painful to the patient and is a tedius process.
If we use machine learning, we will be able to diagnose malignant tumors just by getting features from screening, wthus /ithout biopsy.
유ㅠ방암 진단 Invasive method. Biopsy
발생 과정 => benign to malignant graphical representation
https://www.mypathologyreport.ca/ko/pathology-dictionary/biopsy/
https://doi.org/10.1016/j.jmoldx.2021.01.006
Then what data do we use to train this model?
We are going to use the Wisconsion breast cancer dataset.
This has information about the nuclei of the cell which is obtained by biopsy.
Dataset 설먕 할때 +> feature 설명도 해.
Wisconsion website
Add pg 15 immunohistochemistry => how was the data made?
Add images with malignant and bemign tumors => can see how the answers of the dataset was made.
Biopsy 어려운 암들이 있다. => image data diagnosis helpful
IMPORTANDCE => noninvasive method.
The data insclues the following features. I will look at each one by one briefly.
Lets assume this circle is the nucleus of a cell.
The radius will be this part, the perimer and the area will be here.
The texture is standard deviation of grey-scale values. This indicates a score of the grewy and white areas of the cell.
The smoothness values are local variations in radius lengths.
The compactness is calculated with the perimeter and the area.
The concavity is the severity of the concave portions of the contour
The concave points is the number of points
Symmetery values
Finally frational dimension is found with the coastline approximation
Lets assume the shape on the left is a tumor nuclei. We see that the outside is uneven. Here we can measure the mean radius, and also the worst radius. We can then calculate the standard error using the equation above. Therefore, there are three measurements for a particular feature.
This type of measurement is seen for all features. Therefoe,
As you can see here, since there are 10 distinct featues, we end up with 30 different measurements. There is also the ID and the diagnosis and we end up with 32 colunms.
Therefoere, we will be able to diagnose cancer by obtaing different morphological characteristics through imaging.
Now we will look through the machnine learning process.
First, we will load and check the data.
We first check the data. As stated before, there is the ID, diagnosis, the 30 features, and the Unnamed column.
WAIT
We then check for null values. We see that the Unnamed column consist of only null values. We will remove them later.
Then for the outlier detection. We will drop them later on.
We also check the summary and the statistics of the numerical data.
Finally, we se the number of each Malignant and benign subjects.
Next is explanatory data analysis.
This is a clustered heat map. We can observe two main things. First we see that the area, perimeter and radius are highly correlated. Also we can see that compactness and concave points and concativy are related to each other.
Since there are too many features, we will look at separate ones for each case.
Se 는 없애되 됨. => Coefficeint 모시기 할대
Se 는 없애되 됨. => Coefficeint 모시기 할대
Histogram => y axis no meaning
Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
Histogram => y axis no meaning
Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
Histogram => y axis no meaning
Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
Histogram => y axis no meaning
Here, we see plots of each mean and worst. The diagonal is a histogram. Others are scatter plots which show the relation between the two different features.
Intuitively: Tumor burden & malignancy => area is real area found in image
Next is a heat map about the compactness concavity and concave points. The slots with greater value than 0.9. we will visualize the correlation of the three features
Next is a heat map about the compactness concavity and concave points. The slots with greater value than 0.9. we will visualize the correlation of the three features
Here, we have graphed the points with each. The personr value is the correlation value. We can see that it ranges from 0.86 to 0.92. Thus we can see that they have a high correlation. The 0.92 value between concavity and concave points is because they are defined with relation to one another.
Here, we have graphed the points with each. The personr value is the correlation value. We can see that it ranges from 0.86 to 0.92. Thus we can see that they have a high correlation. The 0.92 value between concavity and concave points is because they are defined with relation to one another.
Next we will look at the distribution the features. We will look at the worst features.
Next we will look at the distribution the features. We will look at the worst features.
Here we can see that although there are features with a clear distinction like those in red, we can see others. Like the blue box, to see less distribution.
Now we will look at feature engineereing.
Next we datadardize the data. this is done by using the standard scaler and is done to unifiy the units.
G score transformation
To 맞춰 units => 편향되지 않게
We first delete the outlies detected in the first step.
Next is the modeling step. We are going to use 8 different models. ANN, SVM, Decision tree, ada boost, random forest, and extra trees.
Next is the modeling step. We are going to use 8 different models. ANN, SVM, Decision tree, ada boost, random forest, and extra trees.
Since our data does not have separate testing and training data, we need to seperatre them. We do this with a 70 to 30 ratio.
Now for ANN modeling. We obtimized the parameters as seen here. There are 4 hidden layers and the learning rate is 0.001 and epoch is 200. After trial, I found this to be the best results.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
In order to find models to compare with, I cross validated 7 different models.
The cross validation score indicates how well the model will evaluate unseen data.
Here we used stratified kFOld cross validation. This ensures that each fold has approximately the same distribution of target classes as the dataset.
I took the top 4 models. SVC, Extra trees, gradient boost classifier and logistic regression.
We can see that SVC classifier has the highest values. Some potential reasons for this is the following.
Due to the working mechanism, SVC is effective for data with clear distinction. Also the dataset may be reletavily small for ANN to have the best results.
Finally we will look at Explinable AI models. In particualer, SHAP and Permutation feature importance.
This is the force plot with sample order by similarity. The red are features that contribute to classifiying as malignant, and the blue are those that contribute as classification as benign.
This is the force plot with sample order by similarity. The red are features that contribute to classifiying as malignant, and the blue are those that contribute as classification as benign.
We will look at two different examples. The one on the top is about number
해석 의학적 => why area worst and radius worst important in finding whether M or B
해석 의학적 => why area worst and radius worst important in finding whether M or B
전체적으로 월했는지 1-2 + 여기서 다루지 못했돈고 + future work: 개선 점 + 이거저거 하면