This document proposes a method for classifying breast cancer cells using unsupervised linear transformation (PCA) along with cosine similarity. It involves the following steps: (1) applying PCA to select robust features from a breast cancer dataset, (2) projecting the data into a lower dimensional space using the selected features, and (3) classifying the cells as normal, benign, or malignant using cosine similarity in the reduced dimensional space. Experiments show the accuracy increases from 78.9% without PCA to 99.12% when using the proposed PCA-Cosine similarity method, demonstrating its effectiveness for breast cancer classification.
1. Breast Cancer Classification based on Unsupervised Linear
Transformation along with Cos Similarity
Machine Learning
Dr. Ashwan A. Abdulmunem
8/2/2021
2.
3. Introduction
- Breast cancer is one of the leading causes of mortality in women. Early detection and treatment are
imperative for improving survival rates.
- According to a recent report published by the American Cancer Society, breast cancer is the most prevalent
form of cancer in women, in the USA. In 2017 alone, studies indicate that approximately 252,000 new cases of
invasive breast cancer and 63,000 cases of in situ breast cancer are expected to be diagnosed, with 40,000
breast cancer-related deaths expected to occur [1]. Consequently, there is a real need for early diagnosis and
treatment, in order to reduce morbidity rates and improve patients’ quality of life.
1)DeSantis, C.E., Ma, J., Goding Sauer, A., Newman, L.A., Jemal, A.: Breast cancer statistics, 2017, racial
disparity in mortality by state. CA: a cancer journal for clinicians 67(6) (2017) 439–448
https://www.memorialplasticsurgery.com/breast-cancer-statistics-2017/
5. Breast Cancer: General Classification
Approaches
● Grade. Grading focuses on the appearance of the breast cancer cells compared to the appearance of normal
breast tissue. Normal cells in an organ like the breast become differentiated, meaning that they take on specific
shapes and forms that reflect their function as part of that organ. Pathologists describe cells as well differentiated
(low-grade), moderately differentiated (intermediate-grade), and poorly differentiated (high-grade) as the cells
progressively lose the features seen in normal breast cells.
● Stage. The TNM classification for staging breast cancer is based on the size of the cancer where it originally
started in the body and the locations to which it has travelled.
TNM stands for:
tumour
node
metastasis
● DNA-based classification. Understanding the specific details of a particular breast cancer may include looking
at the cancer cell DNA by several different laboratory approaches. When specific DNA mutations or gene
expression profiles are identified in the cancer cells this may guide the selection of treatments, either by targeting
these changes, or by predicting from these alterations which non-targeted therapies are most effective.
7. Proposed Method: Abstract
- Detection and classification of breast cancer at the cellular level is one of the most
challenging problems. Since the morphology and other cellular features of cancer
cells are different from normal healthy cells, it is possible to classify cancer cells
and normal cells using such features.
- The classical methods of segmentation and classification for malignant cells are not
only repetitive but also very time-consuming[2].
- Using PCA to select robust and informative features
[2]Khan, S.U., Islam, N., Jan, Z. et al. A machine learning-based approach for the segmentation and classification of malignant
cells in breast cytology images using gray level co-occurrence matrix (GLCM) and support vector machine (SVM). Neural
Comput & Applic (2020). https://doi.org/10.1007/s00521-021-05697-1
11. Breast Cancer Dataset
o Number of instances 569
o ID number of patients
o Diagnosis (M = Malignant, B = Benign)
o 30 Features
Ten Real values
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension
The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for
each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
14. Unsupervised Linear Transformation or Dimensionality
Reduction (PCA)
We propose to use a combination of PCA with Cos similarity algorithms to find best features of
Cancer dataset named PCA-Cos algorithm. Principal Component Analysis (PCA), is well
known for reduction dimensional and statistical measurements in big data manipulating.
15. PCA (cont.)
Sometimes we need to "compress" our data to speed up algorithms or to visualize data. One way is to use
dimensionality reduction which is the process of reducing the number of random variables under
consideration by obtaining a set of principal variables.
Two approaches:
Feature selection: find a subset of the input variables.
Feature projection (also Feature extraction): transforms the data in the high-dimensional space to a space
of fewer dimensions. PCA is one of the methods following this approach.
16. PCA (cont.)
mathematically" (precisely)? We need to know about:
• Mean: finds the most balanced point in the data.
• Variance: measures the spread of data from the mean.
• Covariance: indicates the direction in that data are spreading.
17. PCA Algorithm
1.Subtract the mean to move to the original axes.
2.From the original data (a lot of features x1,x2,…,xN), we construct a covariance matrix U.
3.Find the eigenvalues λ1,λ2,… and correspondent eigenvectors v1,v2,… of that matrix (we call
them eigenstuffs). Choose K<N couples λ and v (the highest eigenvalues) and we get a reduced
matrix K<N.
4.Projection original data points to the K-dimensional plane created based on these new
eigenstuffs. This step creates new data points on a new dimensional space (K).
5.Now, instead of solving the original problem (N features), we only need to solve a new problem
with K features (K<N).
20. Cosine Similarity :
•A measure of similarity between two non-zero vectors of an inner
product space
•The cosine of the trigonometric angle between two vectors
•The inner product of two vectors normalized to length 1
•Not a measure of vector magnitude, just the angle between vectors
24. ◼Based on the experiments we can conclude that, The Cos
similarity learning can work effectively along with PCA
algorithm. By using this combination, the results obviously
improved. The accuracy rate without PCA is 78.9% about 24
false negatives values from whole testing instances. While
when using PCA the accuracy increased to 99.12% give
more acceptable findings to justify this combination. As a
result, a machine learning with effective feature selection
give a reliable outcome in more vital problem in the health
community.
CONCLUSION