1. Introduction
Random forest is one of the most successful integration methods, showing excellent
performance at the level of promotion and support vector machines. The fast, anti-noise process
does not overfit and provides the possibility to interpret and visualize its output. We will study
options to increase the strength of individual trees in the forest or reduce their correlation. Using
several attribute evaluation methods instead of just one method will produce promising results.
On the other hand, in most similar cases, using weighted marginal voting instead of ordinary
voting can provide statistically significant improvements across multiple data sets.
Nowadays, machine learning (ML) is becoming more and more critical, and with the
rapid growth of medical data and information quality, it has become a key technology. However,
due to complex, incomplete, and multi-dimensional healthcare data, early and accurate detection
of diseases remains a challenge. Data preprocessing is an essential step in machine learning. The
primary purpose of machine learning is to provide processed data to improve prediction
accuracy. This dissertation summarizes accessible data preprocessing steps based on usage,
popularity, and literature. After that, the selected preprocessing method is applied to the original
data, and then the classifier uses it for prediction.
Data mining faces the test of finding orderly information in critical information streams
to help the executives dynamic. Although the examination on activities research, direct
showcasing and AI centers around the investigation and structure of information mining
calculations, the connection between information mining and the past phase of information
Preprocessing has not been concentrated in detail. This paper considers the impacts of various
preprocessing techniques of appropriate scaling, testing, order coding, and constant trait coding
on the exhibition of choice trees, neural systems, and bolster vector machines.
2. Problem statement.
We are utilizing machine learning to predict breast cancer cases through patient treatment
history and health data. We will utilize the Data set of Wisconsin breast cancer center. Among
ladies, breast cancer is the main source of death. Breast cancer risk prediction can give
information to screening and preventive measures.
Recent studies found that adding contribution to the broadly utilized Gaelic model can
improve its capacity to anticipate breast cancer risks. Be that as it may, these models utilize
straightforward factual designs, and other information originates from costly and obtrusive
procedures.
Interestingly, we need to come up a machine learning model that utilizes individual
health data to predict breast cancer risk for more than five years. There is a need to come up with
a machine learning model utilizing just Gail model information and a model utilizing Gail model
information and other individual health data identified with breast cancer hazard.
The essential objectives of cancer prediction are not the same as those of cancer
recognition and determination. In cancer prediction/visualization, one is identified with three
basic purposes of prediction: 1) cancer vulnerability prediction (i.e., risk evaluation), 2) cancer
recurrence prediction and 3) cancer endurance rate prediction. In the first case, individuals are
attempting to foresee the probability of building up a specific sort of cancer before it happens. In
the subsequent case, individuals are attempting to foresee the chance of creating cancer after the
infection has vanished.
In the third case, individuals attempt to anticipate the result (life hope, endurance,
movement, sedate tumour affectability) after the disease is diagnosed. In the last two cases,
prognostic prediction's prosperity depends to some extent on the achievement or nature of the
3. finding. Be that as it may, the forecast of the infection must be accomplished after clinical
finding, and visualization prediction must think about more than a basic determination.
Through a multifaceted analysis of the variance of various performance indicators and
method parameters, it is possible to evaluate and provide empirical evidence that data
preprocessing will significantly affect the accuracy of prediction, and that specific solutions have
proven inferior to competing methods. It is also found that: (1) The selected method is proved to
be sensitive to different data representation methods such as method parameterization, which
shows the potential of improving performance through effective preprocessing; (2) The influence
of the preprocessing scheme depends on the process.
Different, indicators that use various "best practice" settings can improve the amazing
results of a particular method; (3) Therefore, the sensitivity of the algorithm to preprocessing is a
necessary criterion for method evaluation and selection. In predictive data mining, it needs to be
different from traditional methods. Careful consideration of forecasting ability and calculation
efficiency indicators.
To maximize the prediction accuracy of data mining, machine learning research mainly
focuses on enhancing competitive classifiers and effectively adjusting algorithm parameters.
This is usually tested in extensive benchmark experiments, using pre-processed data sets to
evaluate the impact on prediction accuracy and computational efficiency.
In contrast, the research on component selection resampling and continuous quality
discretization has been studied in detail, and there are not many publication survey data
predictions that will affect classification attributes and scaling. More critically, in data mining,
especially in the medical field, there is no precise analysis of the interaction of prediction
accuracy.
4. 3.1. Preprocessing methods
This dissertation considers the three main standard preprocessing steps of NLP:
stemming, punctuation expulsion, and stop word evacuation. In stemming analysis, we obtain the
stem type of each word in the data set, which is a piece of the name that can be attached with
affixes.
The blocking algorithm is language-specific and differs in performance and accuracy. A
wide range of methods can be used, such as fasten deletion stemming, n-gram stemming, and
table inquiry stemming. A critical preprocessing step of NLP is to expel punctuation, which-used
to separate the content into sentences, paragraphs, and phrases-affects the results of any content
processing method, especially the effects that rely upon the recurrence of words and phrases
because punctuation is Often used in the content.
Before any NLP processing, the most common terms used in stop words are erased. A
gathering of as often as possible used words without some other information, such as articles,
specific words, and prepositions called stop words. By eliminating these original words from the
content, we can focus on the critical words.
Significance of using Random Forest?
Whether you have a regression task or a classification task, a random forest is a suitable
model to solve your problem. It can handle dual features, classification features and numeric
features. Hardly any pretreatment is required. The data should not be rescaled or transformed.
They are parallelizable, which means we can split the process into various machines to
run. This can shorten the calculation time. On the contrary, the upgraded model is sequential and
5. takes longer to calculate. In fact, in Python, to run this code on many computers, add "jobs = -1"
to the boundary. One way is to use every available PC. Great and high size.
Training is faster than decision trees, because we only arrange part of the features in the
model, so we can easily use hundreds of features. The prediction speed is significantly faster
than the training speed because we can save the resulting forest after some time. Random forest
deals with outliers by essentially classifying them. It is also indifferent to nonlinear features.
It has a way to balance errors in the general embarrassment of the class. Random forest
tries to minimize the overall error rate. When the data set is not uniform, the wider the
classification, the lower the error rate, and the lower the classification, the higher the error rate.
The difference between each decision tree is larger, and the deviation is smaller. Nevertheless,
since we normalized all the trees in the random forest, we also normalized the normalization, so
we have a small deviation and a medium difference model.
As with any algorithm, there are advantages and disadvantages to using it. The
advantages and disadvantages of using the random forest for classification and regression. The
random forest algorithm does not depend on any model because there are various trees, and each
tree is trained on a subset of the data.
The random forest algorithm relies on the strength of the "group". Therefore, the general
deviation of the algorithm is reduced. The algorithm is completely stable. Regardless of whether
new data points are introduced in the data set, the general algorithm will not be affected too
much, because the original data may affect one tree, but it is difficult to change all trees.
The random forest algorithm with both classification and numbering functions works
well. The random forest algorithm can also work well when the data lacks values or is not scaled
6. proportionally (although we have scaled the elements in this article only for demonstration
purposes).
Drawbacks
Interpretability of the model: The random forest model is not easy to interpret. They are
similar to secret elements. For large data sets, the size of the tree can take up a lot of memory. It
may be too suitable, so you should adjust the Hyperparameters. It has been observed that random
forests are too suitable for specific data sets with noisy classification/regression tasks. It is more
complicated than the decision tree algorithm and requires a lot of calculation. Due to their
complexity, they require more training opportunities than other similar algorithms.
Materials and methods
Data
The model was trained and evaluated on the PLCO dataset. This data set was generated
as part of a randomized, controlled, prospective study to determine the effectiveness of different
prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research
and filled out the baseline questionnaire, detailing their previous and current health status. All
processing of this data set is done in Python (version 3.6.7).
We initially downloaded the data of all women from the PLCO data set. The dataset
consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the
following conditions:
1. Lack of data on whether they have been diagnosed with breast cancer and the time of
diagnosis
7. 2. Were diagnosed with breast cancer before the baseline questionnaire
3. Not Self-identification as white, black, or Hispanic
4. Identified as Hispanic, but no information about the place of birth
5. Missing data for 13 selected predictors
Before the baseline questionnaire, we excluded women who had been diagnosed with
breast cancer because BCRAT was not sufficient for women with a personal history of breast
cancer.
BCRAT is also not suitable for women with breast cancer who have received chest
radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal
carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil
syndrome. Since there is no data for these conditions in the PLCO data set we assume that these
conditions do not apply to any women in the data set. Since only PLCO white, black, and
Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded
specific topics based on self-identified race/ethnicity.
We do not include subjects who consider themselves Hispanic but do not have data on
their place of birth because BCRAT implements different breast cancer compound rates for US-
born and foreign-born Hispanic women. When deleting objects based on these conditions, we
reduced the number of women to 64,739.
We trained a set of machine learning models that fed five of the usual seven inputs into
BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first-
degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in
the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
8. Our input to the model with a broader set of predictors includes five BCRAT data and
eight additional factors. These other predictors were selected based on the availability in the
PLCO data set and their correlation with breast cancer risk including menopausal age, indicators
of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of
years of birth control, the number of live births, and personal cancer history indicators.
To facilitate the training and testing of the model, we made limited modifications to the
predictor variables. First, we assign values to categorical variables appropriately. The PLCO data
set classifies age at menarche, age at first live birth, age at menopause, generation of hormones,
and age of birth control as categorical variables. For example, the menarche variable's age code
is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15
years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that
represents the maximum age/age or less (for example, under ten years old), we set the value of
the variable to the maximum value (for example, ten years old).
For values that represent a range strictly less than the maximum value (for example, less
than ten years old), we set the variable's value equal to the upper limit of the range (for example,
less than ten years old). Similarly, for values representing the minimum age/age or above (16
years old or above), we set it to the minimum value (for example, 16 years old). For values that
contain a closed range (for example, 12-13 years old), we set the variable's importance to the
average cost of the field (for example, 12.5 years old).
After modifying the categorical variables, we made some adjustments to the age version
of the first live birth, and the race/ethnic variables entered into the machine learning model. For
the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as
the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of
9. BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and
American-born Hispanic women. For the machine learning model, we set the age of the first live
birth variable of zero birth women to the current generation, and use two indicators to represent
race/ethnicity, one symbol for white women and one indicator for black women. Each woman is
classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the
white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s
white and black racial symbols are both 0. For the machine learning model, we did not
distinguish between Hispanic women born in the United States and Hispanic women born
abroad.