Prediction for breast cancer using various machine learning algorithms

Prediction for Breast cancer using various
Machine Learning Algorithms
Project Batch Details:
Batch Information:
LUCKY SHETTY [1KN20CS015] PROJECT GUIDE :
NAVEENA C K [1KN20CS026] Prof. Kusum Rajput Dept. of CSE
VISHNU BABU B [1KN20CS050]

Abstract:
Breast cancer has replaced lung cancer as the number one cancer among
women worldwide. The combined sampling method is used to solve the
problem of sample imbalance, and the data are standardized to make the
data have better separability . The final results of each model are
derived using a 10-fold cross-validation method.

Introduction:
• Breast cancer, as one of the common malignant tumors in women, has
become a focus of public health attention around the world.
• Machine learning, as an important artificial intelligence technology, has the
ability to extract features, discover patterns and build predictive models
from a large amount of medical data.
• For breast cancer diagnosis, the application of machine learning has
revolutionized the field and achieved remarkable results.

Literature Survey:
Reference Datasets Used Machine Learning Algorithms Key Findings
R. L. Siegel, K. D. Miller, N. S.
Wagle, and A. Jemal, ‘‘Cancer
statistics, 2023,’’ CA, Cancer J.
Clinicians, vol. 73, no. 1, pp. 17–
48, Jan. 2023.
Wisconsin Breast Cancer
dataset
Logistic Regression, SVM, Decision
Trees
Demonstrated the efficacy of SVM in
classifying breast cancer based on
clinical data, achieving high accuracy
and sensitivity.
M. S. Iqbal, W. Ahmad, R.
Alizadehsani, S. Hussain, and R.
Rehman, ‘‘Breast cancer dataset,
classification and detection using
deep learning,’’ Healthcare, vol.
10, no. 12, p. 2395, Nov. 2022.
Multi-Modal Data Integration Feature Selection, PCA, t-SNE
Investigated the impact of integrating
multi-modal data and highlighted the
importance of feature selection for
model interpretability.
Z. Cai, R. C. Poulos, J. Liu, and
Q. Zhong, ‘‘Machine learning for
multiomics data integration in
cancer,’’ iScience, vol. 25, no. 2,
Feb. 2022
Clinical Data, Imaging,
Genetic Data
XGBoost, Decision Trees
Developed a hybrid model combining
clinical, imaging, and genetic data,
showing promising results in
predicting breast cancer risk.

D. K. Rakesh and P. K. Jana, ‘‘A
general framework for class label
specific mutual information feature
selection method,’’ IEEE Trans. Inf.
Theory, vol. 68, no. 12, pp. 7996–8014,
Dec. 2022.
Standardized Datasets Random Forest, SVM
Advocated for the use of
standardized datasets to ensure
consistency in model evaluation
and compared the performance of
different algorithms.
N. Al Mudawi and A. Alazeb, ‘‘A
model for predicting cervical cancer
using machine learning algorithms,’’
Sensors, vol. 22, no. 11, p. 4132, May
2022
TCGA, Clinical Data
Logistic Regression, Ensemble
Methods, SVM
Explored the interpretability of
models and discussed the trade-
offs between accuracy and
interpretability in the context of
breast cancer prediction.
W. Xing and Y. Bei, ‘‘Medical health
big data classification based on KNN
classification algorithm,’’ IEEE Access,
vol. 8, pp. 28808–28819, 2020
Imaging Data CNNs, Feature Extraction (t-SNE)
Focused on the role of deep
learning in analyzing
mammographic images,
highlighting the significance of
feature extraction methods such as
t-SNE.

System Architecture:
Datasets
Data
Preprocessing
Feature
Selection
Model Training
Threshold
>=90%
Grid Search
method
Cross validation
Best
Model
Contrast analysis
Yes
No
Yes
No

Data Flow Diagram:
Data
Training Data Testing Data
Process Data
Feature
extraction
Process Data
Feature
extraction
WDBC
Classifi
cation
Result

Sequence Diagram:
User System
1. Import modules
2. Load dataset
3. Display dataset
4. Explore data
5. SVM
6. Random Forest
7. Decision tree
8. Logistic regression
9. Classify
10. Result

Use Case Diagram:
Data
preprocessing
Data
Preparation
Feature
Projection
Feature
Selection
Feature
Scaling
Model
Selection
Prediction
Result

Hardware and Software Requirements:
Hardware requirements:
• Processor (CPU):Intel (e.g., Core i7, Xeon) or AMD (e.g., Ryzen, EPYC).
• Graphics Processing Unit (GPU):NVIDIA GPUs (e.g., GeForce, Quadro, Tesla).
• Random Access Memory (RAM):At least 16GB of RAM is recommended.
• Storage:SSDs are preferred over HDDs for faster data access.
• Internet Connection: A stable internet connection is required for downloading
datasets, libraries, and updates during the development process.

Software Requirements:
• Operating System: Linux, Windows or macOS.
• Programming language: Python.
• Integrated Development Environment (IDE): Jupyter Notebooks, VSCode,
PyCharm, and others.
• Machine Learning Libraries and Frameworks: Install libraries such as scikit-learn,
TensorFlow, PyTorch, and Keras.
• Data Manipulation and Analysis: Pandas is a widely used..
• Data Visualization: Matplotlib and Seaborn are common libraries

Proposed system:
Raw data
SMOTE-ENN
combination sampling
Z-score
standardization
Data preprocessing
Mutual
information
SHAP feature
explanation
Feature selection
Model training
KNN SVM
RF LR
Grid search
method
Cross validation
Best
model
Contrast analysis
Yes
No

Logistic regression:
• Linear regression model used for binary classification.
• Suitable for predicting breast cancer risk based on multiple features.
Decision Trees:
• Non-linear model that uses a tree-like structure for classification.
• Can handle both categorical and continuous features.

Random Forests:
• Ensemble learning method that combines multiple decision trees.
• Reduces overfitting and improves accuracy.
Support Vector Machines:
• Uses hyperplanes to separate data into different classes.
• Effective for high-dimensional feature spaces.

Advantages of Proposed System:
• Early Detection
• Risk Assessment
• Personalized Treatment Plans
• Improved Accuracy and Consistency
• Resource Optimization

Existing system:
Raw data
SMOTE-ENN
combination sampling
Z-score
standardization
Data preprocessing
Mutual
information
Recursive feature
elimination
SHAP feature
explanation
Feature selection
Model training
KNN SVM
RF LR
Grid search
method
Cross validation
Best
model
Contrast analysis
XGBOOST
Yes
No

XGBoost:
• XGBoost is a scalable and accurate machine learning algorithm that falls under the
category of gradient boosting frameworks.
• It is an optimized implementation of gradient boosting machines and is widely
used for building predictive models.
Logistic regression:
• Linear regression model used for binary classification.
• Suitable for predicting breast cancer risk based on multiple features.

Decision Trees:
•Non-linear model that uses a tree-like structure for classification.
•Can handle both categorical and continuous features.
Random Forests:
•Ensemble learning method that combines multiple decision trees.
•Reduces overfitting and improves accuracy.
Support Vector Machines:
•Uses hyperplanes to separate data into different classes.
•Effective for high-dimensional feature spaces.

Drawbacks:
• Limited Generalizability: A high accuracy rate on a specific training dataset does
not guarantee similar performance on different datasets or in diverse clinical
settings.
• Lack of Contextual Understanding: Machine Learning algorithms might struggle
with understanding the contextual nuances of medical reports, including sarcasm,
idiomatic expressions, or ambiguous language.
• Inadequate Handling of Medical Jargon: Medical reports often contain complex
terminology and abbreviations.

• Limited Adaptability to Varied Data Sources: Healthcare data comes in
diverse formats, including text, images, and numerical data.
• Sensitivity to Preprocessing Techniques: The accuracy of Machine
Learning algorithms can heavily depend on the preprocessing techniques
applied to the text data.

Conclusion :
• The breast cancer prediction model demonstrates promising results in
accurately predicting breast cancer.
Future Work:
• Further improve the model's performance by fine-tuning the
parameters and optimizing the feature selection process.

References:
• J. Y. Tan, J. Adeoye, P. Thomson, D. Sharma, P. Ramamurthy, and S.-W. Choi, ‘‘Predicting overall survival
using machine learning algorithms in oral cavity squamous cell carcinoma,’’Anticancer Res., vol. 42, no. 12,
pp. 5859–5866, Dec. 2022.
• V. A. Binson, M. Subramoniam, Y. Sunny, and L. Mathew, ‘‘Prediction of pulmonary diseases with electronic
nose using SVM and XGBoost,’’ IEEE Sensors J., vol. 21, no. 18, pp. 20886–20895, Sep. 2021.
• M. U. Rehman, A. Shafique, Y. Y. Ghadi, W. Boulila, S. U. Jan, T. R. Gadekallu, M. Driss, and J. Ahmad, ‘‘A
novel chaos-based privacypreserving deep learning model for cancer diagnosis,’’ IEEE Trans. Netw. Sci.
Eng., vol. 9, no. 6, pp. 4322–4337, Nov. 2022.
• Q. M. Ilyas and M. Ahmad, ‘‘An enhanced ensemble diagnosis of cervical cancer: A pursuit of machine
intelligence towards sustainable health,’’ IEEE Access, vol. 9, pp. 12374–12388, 2021.

Prediction for breast cancer using various machine learning algorithms

More Related Content

What's hot

Similar to Prediction for breast cancer using various machine learning algorithms

Recently uploaded

Prediction for breast cancer using various machine learning algorithms