Heart disease prediction by using novel optimization algorithm_ A supervised ...BASMAJUMAASALEHALMOH
More Related Content
Similar to An ensemble deep learning classifier of entropy convolutional neural network and divergence weight bidirectional LSTM for efficient disease prediction
A Comprehensive Evaluation of Machine Learning Approaches for Breast Cancer C...IRJET Journal
Similar to An ensemble deep learning classifier of entropy convolutional neural network and divergence weight bidirectional LSTM for efficient disease prediction (20)
2. Int J Syst Assur Eng Manag
1 3
KL Kullback–Leibler
KNN
K Nearest neighbor
LM Liebenberg marquardt
LR Logistic regression
MCC
Matthews correlation coefficient
MFWCP
Mode fuzzy weight based canonical
polyadic
MLP Multilayer perceptron
MLT
Machine learning technique
MSD
Mean squared deviation
NB Naive bayes
NN Neural network
ReLU
Rectified linear unit
RNN
Recurrent neural network
SSAE-SM
Stacked sparse autoencoder and softmax
regression-based model
SVM
Support vector machine
TFMF
Trapezoidal fuzzy membership function
UCI
University of California Irvine
WBCD
Wisconsin breast cancer dataset
WDBC
Wisconsin diagnostic breast cancer
WHO
World health organization
1 Introduction
Human beings suffer from a multitude of diseases that affect
them both physically and psychologically. Diseases develop
primarily from infections or deficiencies or heredity traits or
organ dysfunctions. Doctors or medical experts detect and
diagnose disease-affected humans and include medical ther-
apy to treat them. Though many diseases can be cured with
therapies, HDand BCcannot be cured despite treatment, but
medications can prevent these diseases from getting worse
over a period of time. BC is a type of cancer that devel-
ops in breast cells and is a very common illness in women.
According to projections for India in 2020, the number of
Patients might reach two million (Malvia et al. 2017). Men
are more likely to be affected by HD than women. Accord-
ing to WHO, HD is responsible for 24% of non—communi-
cable disease deaths in India. Assessing HD based on risk
factors manually is a challenging task where the diagnosis
is the process of determining, explaining, or establishing a
human’s condition based on their symptoms and indications.
Early and accurate diagnosis is critical because it affects
treatment efficiency and prevents long-term consequences
for the infected person. Diagnostic errors are responsible
for around 10% of patient deaths and a variety of severe
consequences and/or occurrences in hospitals (De and
Chakraborty 2020). Errors in diagnosis may result from
a variety of causes including mismatches in communica-
tions between doctors, patients, and their families or poor
diagnostic procedures, or inefficient information from
healthcare systems.
MLTs are sophisticated, automated techniques for analyz-
ing high-dimensional, multimodal biomedical data and can
greatly expedite and enhance medical diagnosis processes.
One way of using these techniques is predicting depend-
ent variables based on the values of independent variables.
MLTs once designed can repeat tasks with great accuracy,
a crucial factor for making decisions in healthcare. MLTs
which can classify accurately are important components of
CAD (Computer-Aided Diagnosis) systems designed for
assisting medical practitioners who use them in the early
diagnosis of abnormalities. CAD systems help radiologists
visually examine mammograms to reduce chances of mis-
diagnosis due to fatigue or eye strain or inexperience. Thus,
proper use of CAD systems in healthcare can undoubtedly
save lives. MLTs that have been used in CAD systems for
predictions of abnormalities include SVM, KNN, NN, DT,
and NB.
The presence of irrelevant characteristics in training sets
affects classifier performances which can be enhanced by
discarding unnecessary features and choosing a subset of
significant characteristics called feature selections. These
approaches can be supervised (Song et al. 2017)-(Solorio-
Fernández et al. 2020), unsupervised (Sheikhpour et al.
2017), or semi-supervised (Xu et al. 2016)-(Ang et al. 2015)
based on how the training sets are labeled. Filter, wrapper,
and filtering is used in guided component choice techniques
like embedded has minimal effects on classifications. Wrap-
per on the other hand uses prediction accuracies of preset
learning algorithms to measure the quality of selected fea-
tures. An embedded technique like filters begins by using
statistical criteria for selecting many possible feature subsets
with specific cardinalities. Subsequently, the subset with the
best accuracy is chosen for classifications. Unsupervised fea-
ture choices may be used on unmarked data, but it can be
challenging to determine which characteristics are relevant.
Marked and unmarked data are used in semi-supervised fea-
ture choices to determine the feature’s significance. Biologi-
cal processes can be processed by computational algorithms
and thus aid in problem-solving and decision-making (Xue
et al. 2015). On selection of features, DLT like DNN, RNN,
DBN, and CNN can be used to diagnose diseases. Using
ensemble algorithms enhances illness predictions or catego-
rization accuracy (Rokach et al. 2014).
This paper proposes a diagnostic framework for clin-
ics using bio-inspired algorithms for feature selection and
EDL (Ensemble Deep Learning) for the categorization of
BC and HD. The obtained data is initially separated into
categorical and continuously subsets, using BN (Bayesian
Network)-imputed discrete fields and MFWCP-imputed ten-
sor decomposition subsets. In addition, AMCGWO is used
for wrapper-based feature selection technique which results
3. Int J Syst Assur Eng Manag
1 3
in selecting key features from missing data imputations.
Finally, EDL based classifier detects BC and HD. The pro-
posed scheme’s outcomes are evaluated using performance
evaluation metrics. This article’s remainder is organized as
following; Outlines of related works on ensemble classi-
fiers for BC and HD are provided in part 2, and the recom-
mended technique is explained in Sect. 3. Division 4 shows
the details of experimental studies along with results and
discussions. In Sect. 5 research study is concluded in detail.
2 Literature review
Raza (Raza 2019) In order to predict cardiac disorders, a
variety of machine learning techniques, including Logistic
Regression (LR), Neural Networks (NB), and the Multi-
layer Perceptron (MLP), are combined with an accuracy of
88.88%. In order to get a final determination, by weighing
and combining numerous separate classifiers, ensemble
learning is also used to increase the validity of a categoriza-
tion. The output of an assembly system is significantly influ-
enced by how the classifiers’ outputs are mixed. The findings
were compared to previously published studies, which found
that the ensemble technique outperformed individual classi-
fiers in terms of accuracy. The suggested ensembles method
is presented to improve the model’s capacity to forecast car-
diac illness accurately, robustly, and consistently, as well as
to prevent patient’s misinterpretation.
A majority vote ensemble approach developed by Atallah
and Al-Mousa (Atallah and Al-Mousa 2019) predicted the
existence of HD in individuals. Their forecasts were based
on basic, low-cost medical test results conducted at clin-
ics. They trained their model using real-life data consisting
of healthy and sick individuals as they aimed to increase
the trust and accuracy for diagnosing clinicians. In order to
produce more accurate results, the study identified patients
using many MLT. Their strategy resulted in 90% accuracy
of detection while using the hard voting ensemble model.
Sapra et al. (Sapra et al. 2021) employed MLT in the
diagnosis of HD. Their scheme was a quick recursive pro-
cedure that was very low in cost and accurate. Patient data
from clinics were inputs for the scheme which was predicted
based on these low-cost clinical test results. Moreover, since
the scheme was put to test on the results of patients and
healthy individual’s results, its results on predictions were
more trustworthy. The study also benchmarked several MLT
for evaluations where they found that their proposed strategy
which employed a hard voting ensemble model resulted in
90% accuracy.
Latha and Jeeva (Latha and Jeeva 2019) investigated
ensemble classifications by combining numerous classi-
fiers to increase the accuracy of weak algorithms. Their
experiments using the tool were executed on a heart
disease database dataset. The study employed comparative
analytical methods to see how ensembles could be used to
increase the prediction accuracy of HD. The use of ensem-
ble classification resulted in a maximum accuracy boost of
7.00% for weak classifiers. Implementing feature selection
boosted the process’s performance even further, with the
results indicating a significant rise in prediction accuracy.
In their data preparation, Baccouche et al. (Baccouche
et al. 2020) proposed a feature selection phase. Experi-
ments with unidirectional and bidirectional NNs revealed
that ensemble classifiers with Bi-LSTM model combined
with CNN achieved the most successful categorization
result for predicting various types of HD and with accu-
racy and F1-scores ranging from 91.00 to 96.00%. DLT
based ensemble-learning framework could address classifi-
cation issues of unbalanced heart disease datasets and their
proposed technique could result in exceptionally accurate
models suitable for clinical real-world diagnosis.
Kadam et al. (Kadam et al. 2019) Sparse Auto encoders
and Softmax Regression were used to categorize BC as
benign or cancerous. The suggested approach was tested
using the UCI machine learning library’s Wisconsin Diag-
nostic Breast Cancer (WDBC) dataset utilizing efficiency
in categorization, sensitivities, recollection, recollection,
clarity, and f measure and Matthews correlation coefficient
(MCC). The method has excellent reliability and efficiency
characteristics. Their approach beat SSAE-SM and other
classifiers in experiments, indicating the approach may be
beneficial for categorizing BC.
Elgin Christo et al. (Elgin Christo et al. 2019) described
a clinical diagnostics system that used bio-inspired feature
selection approaches with gradient descendent BPNN for
classifications. The study’s correlation-based ensemble
feature selections selected best features from three fea-
ture subsets which were obtained via correlation-based
ensemble feature selections and subsequently trained on
gradient descendent BPNN. The study used ten-fold cross-
validations to train and assess the classifier’s performance
where classification accuracy was assessed using the UCI
Machine Learning Repository’s Hepatitis dataset and
WDBC dataset.
Liu et al. (Liu et al. 2018) proposed the use using CNN
to enhance the accuracy of categorization for datasets. The
investigation to enhance organized data categorization effi-
ciency recommended using FCLF-CNN (Fully Connected
Layer First- Convolutional Neural Network). Before the first
convolutional layer, fully connected layers are combined and
fully connected layers are used as encoders or approximates
to transform raw data into representations of locations. To
boost performances, the study trained four different types
of FCLF-CNN and combined them to form an ensemble
FCLF-CNN. The results from WDBC and WBCD datasets
were cross validated five folds. In classification results, the
4. Int J Syst Assur Eng Manag
1 3
proposed FCLF-CNN outperformed MLP and CNN on both
datasets.
Masud et al. (Masud et al. 2020) created shallow propri-
etary CNNs that outperformed pre-trained models across
a wide range of performance metrics. To avoid bias, the
study’s model was trained using a fivefold cross-validation
strategy. Furthermore, the model was simpler to train than
pre-trained models as it required very few trainable param-
eters. The proposed Grad-CAM (Gradient-weighted Class
Activation Mapping) heat map visualizations clearly dis-
played that the proposed framework could extract crucial
characteristics for diagnosing BCs.Recently, CAD system
is developed to detect classes in an efficient manner. How-
ever single classifiers are will not enhance performance of
the system due to irrelevant features, incomplete dataset.
Ensemble learning is a technique for producing with numer-
ous basic classifications, a fresh one is created. created that
outperforms any constituent classifier and then issues of
missing data, and feature selection has been also solved by
factorization, and optimization methods.
3 Proposed methodology
The major advantage of the suggested strategy aims to
improve breast cancer and heart disease performances.
prediction by employing an EDL classifier. The proposed
approach comprises five stages i.e., data splitting, data pre-
processing, feature selection, training model, and perfor-
mance evaluation. In the first Stage, the actual data is split
into discrete and continuous feature sets. The missing value
imputation is performed by the Bayesian network (BN) in
the second stage. From this, reconstruction of data and ten-
sor factorization is performed by MFWCP. In the third stage,
Feature selection by AMCGWO is performed for reducing
the number of features in the dataset. In the fourth Stage, the
EDL classifier is utilized to improve two classifiers’ accu-
racy of ECNN and DWBi-LSTM for diseases prediction.
The evaluation metrics like precision, sensitivity, specific-
ity, F-measure, and accuracy for assessing the classifiers.
Figure 1 depicts the proposed method architecture.
3.1
Imputation methods for incomplete dataset
When data values are missing, classifier accuracy suffers,
and Imputed values based on these data become essential.
BN makes assumptions in place of lacking data in that study
because of its capacity to represent ambiguity through causa-
tive relationships among factors. This work’s primary goal
is to impute imbalanced datasets for enhanced predictions
of BC. Missing at Random (MAR) is used to attain values
that are absent from one of the reported databases instances.
In this study, the dataset is utilized to teach the Directed
Acyclic Graph (DAG) characteristics of dependent prob-
abilities distributes. With only two stages, EM (Expecta-
tion–Maximization) is a fast BN technique that iteratively
finds the greatest probabilities. The Expectations phase com-
putes the log probability of the data, and a Log step that
computes the network’s current structure, log, and param-
eters (Franzin et al. 2017). The maximizing stage next iden-
tifies the parameters that maximize the probability from the
prior action. Repeating the process up to the network reaches
equilibrium or no parameters are present. As a consequence,
the training of missing data is successful.
3.2
Reconstruction and imputation using tensor Via
MFWCP
Tensors are configurations that exist across several dimen-
sions, and the degree of a tensor is proportional to the num-
ber of dimensions it contains. Tensor factorizations result in
more accuracy, but they take much more time to compute.
Tensor factorization models Tucker and CP are quite well-
known at this point (Yang et al. 2017). MFWCP, factoriza-
tion has as its major objective the correction of errors that
occur during the restoration of basic tensors as well as the
aggregate of singular tensors ranking tensors with the fewest
t deviations from the original tensor. TFMF (Trapezoidal
Fuzzy Membership Function) are used to construct fuzzy
weight tensors that are not negative and have fuzzy member-
ship values. In order to compare comparable weighted ten-
sors, such as the original tensor, to deliberate missing data
imputation. Reconstruction results are measured via Mean
Squared Deviation (MSD), and the results are calculated
using Eq. (1) (Vazifehdan et al. 2019):
where Output(t)i—estimated value in tth iteration and n—
count of missing values.
3.3
Feature selection Via AMCGWO
In this method, AMCGWO is used to choose features,
and EDL is used to determine the optimum feature set.
AMCGWO imitates grey wolves’ hunting and prey-search-
ing behaviours for optimum database component extraction.
AMCGWO believes grey wolf social structures relate 𝛼 first,
𝛽 second, 𝛿 at next, and finally 𝜔 wolves. 𝛼 are the dominant
wolves that are utilized for leading and controlling the entire
pack of grey wolves in order to choose those with the most
desirable characteristics.𝛽 wolf is the supreme candidate that
receives feedback from other wolves and provides it to the
head wolf. The next level of wolves, i.e.𝛿 wolves, control
the wolves, and final level 𝜔 wolves that are dependable
(1)
MSD =
1
n
n
∑
i=1
(
Outputi(t) − Outputi(t − 1)
)2
5. Int J Syst Assur Eng Manag
1 3
for preserving the consistency and safety of the wolf group
(Faris et al. 2018). The ranges of the method’s regulat-
ing parameters, including a, A, and C, are first assessed
the random vectors now ��⃗
r1 and ��⃗
r2 within [0, 1] are used
to the wolves getting between them and their prey. Here,
the concept’s mean is used to calculate the vector value.
Enhance the randomized vectors if the mean value is more
important for categorization; otherwise, lower it. Though
the convergence rate of GWO is high, thus it doesn’t work
fine in identifying global optima that affect the algorithm’s
rate of convergence. Thus, so as to decrease the effect and
enhance efficiency, the AMCGWO method was built by hav-
ing confusion in the GWO method. For these chaotic maps,
the initial value is identified between 0 and 1. Nevertheless,
these initial values can have a significant change in chaotic
map patterns. The current collection of chaotic maps is cho-
sen using a variety of behaviors, with the starting value set
at 0.7. Initially, stochastic initialization for the population is
performed through the quantity of grey wolves. Next, map-
ping of the selected ICMIC map with the approach is per-
formed during the initialization of the initial chaotic value
and the variable (Gandomi and Yang 2014). The AMCGWO
Fig. 1 Proposed Feature Selec-
tion and Ensemble Deep Learn-
ing (EDL) Classifier for Disease
Diagnosis
Incomplete Data
Tensor factorization with MFWCP
Ensemble Deep Learning (EDL)
Dataset with discrete missing values Dataset with continuous missing values
Missing data imputation by Bayesian network Reconstruction and imputation using tensor factorization
Optimal filled dataset
Mean
SquaredDeviation
(MSD)
No
Yes
ECNN DWBi-LSTM
AMCGWO
Performance analysis
Bootstrap aggregation
6. Int J Syst Assur Eng Manag
1 3
approach’s parameters a, A, and C are specified as being
comparable to CGWO in order to be employed in extraction
and exploratory operations. All of the grey wolves’ fitness is
evaluated using the benchmark function, and characteristics
are then ranked according to their fitness values. The most
suited wolf is the best outcome of the AMCGWO procedure
at the last iteration.
3.4
Ensemble deep learning (EDL) classification
Ensemble is a method that may be used to increase a classi-
fier’s accuracy. It’s a helpful meta categorization strategy to
pair pairing less capable students with more capable students
to increase the effectiveness of the less capable students.
The performance of numerous illness detection algorithms
is improved in this work using the ensemble deep learning
(EDL) technique. The goal of integrating several classifiers
is to achieve greater performance than an individual classi-
fier. Figure 2 shows theensemble deep learning (edl) process.
In this work two classifiers like ECNN and DWBi-LSTM
are combined via bootstrap aggregation (Ren et al. 2017).
3.4.1
Entropy convolutional neural network (ECNN)
CNN is modeled as an FFNN (Feed-Forward Neural Net-
work) with fully linked, compression, and max-pooling lay-
ers. Convolutions come next, then max-pooling layers, and
the completely linked layer serves as the final layer (Bashir
et al. 2015).
3.4.1.1 Convolutional layers In convolutional layers, the
weights are represented as the multiplicative factor of the
filters.In the proposed work, the weight of the Convolutional
layers is computed via the entropy function. Entropy is used
to compute the weight value of the layer by considering the
feature range to classes. If the feature range is higher for the
positive class, then the entropy range is also higher which
results in increased weight value and a reduced bias value.
If the feature range is lower for the positive class, then the
entropy range is lower which results in reduced weight value
and a reduced bias value. Based on this classifier results are
enhanced for disease diagnosis. Let vi ∈ ℝk
be k dimen-
sioned feature vector related to the
ith
sample of the data-
base. An extended dataset is indicated by Eq. (2),
here ⊕ represent the concatenation operator. A filter
w ∈ ℝhk is used by the convolution operation, which is uti-
lized in the time period for developing a fresh feature using
h features. Consider, a feature ci is developed from a window
of features vi∶i+h−1 by Eq. (3),
where b ∈ R indicate a hyperbolic tangential and biased
factor are examples of non-linear functions. The fil-
ter is utilized by every feature window in the dataset
{
v1∶h, v2∶h+1, ..., vn−h+1∶n
}
to build a feature map by Eq. (4),
with c ∈ ℝn−h+1. Max pooling method is performed on
the feature map and acquires maximum value ̂
c = max{c}
as the feature related to this filter. The goal is to search for
highly significant features with high value for every feature
map.
3.4.1.2 Dropout layer The dropout is performed with
weighted vector l2-norms of constraint to have regulariza-
tions. It is mentioned by Eq. (5),
The output y in forward propagation, z is denoted as the
input samples, dropout utilizes Eq. (6),
where ◦ represent component multiplying operation, as well
as r ∈ ℝm, the vector used to mask Bernoulli random vari-
ables when p is 1. Gradients are transmitted backwards using
uncovered integers. Sizing of learnt weight vectors by p is
done during the test period so ̂
w = pw, and ̂
w is utilized
to achieve samples. In addition to that l2-norm restriction
of the weighted vectors by w’s resizing to ||w||2 = s when
||w||2 s after a step of steep decline.In Eq. (6), w and b
denote the weight and bias of classifier which is calculated
(2)
v1∶n = v1 ⊕ v2 ⊕ … ⊕ vn
(3)
ci = f
(
w.vi∶i+h−1 + b
)
(4)
c =
[
c1, … .cn−h+1
]
(5)
y = w.z + b
(6)
y = w.(z◦r) + b
Training set 1 Training set 2
Classifier 1 Classifier 2
Ensemble Classifier
Combined results using
averaging
Prediction in
the test set
Feature Selection
Test set
Fig. 2 Ensemble deep learning (EDL) process
7. Int J Syst Assur Eng Manag
1 3
via the entropy basedon feature importance. The quadratic
entropy of information is calculated by Eq. (7),
Here P(x=k) is the chances that a specified characteristic
will have a certain number, k. if the entropy value is higher
than the weight and bias of the vector or the samples is
increased. It may give importance to the feature to the clas-
sifier to predict the positive or negative class. If the entropy
is higher belonging to the positive class, then the w and b
of the classifier is increased to improve the prediction rate.
3.4.1.3 Softmax layer or fully connected layer ReLU is
employed as an activation function. ReLU definition is dem-
onstrate in Eq. (8)
When x0, output=0, x0, output=x.
3.4.1.4 Output layer The final layer contains n neurons
related to n feature classes. This is a fully connected layer.
The common method is considering the high output neuron
as a class label of given input in classification (Sainath et al.
2013).
3.4.2
Divergence weight bidirectional‑ long short‑term
memory
DWBi-LSTM classifier is used to diagnose various dis-
eases in this work. LSTM is a specialized RNN architec-
ture designed to learn long-term relationships (Sahoo et al.
2020). The cell additionally takes previous cell output state
(Ct−1), the cell input state (̃
Ct), and the cell output state (Ct).
For classification of various diseases, the LSTM architecture
comprises three gates: forget, input, and output, abbrevi-
ated as ft, it, and ot correspondingly.DWBi-LSTM classi-
fier weight is computed via Kullback–Leibler (KL) diver-
gence function. If the feature range is wide, the KL value
is enhanced, resulting in a larger weight value. Otherwise,
the classifier’s weight value is lowered for classification. It
improves the classifier’s accuracy and lowers the system’s
error. The cell state serves as a network storage, transmitting
important data along the series. Which data on the status of
the cell is authorized is determined by the gateways, which
are NN. When the HD and BC datasets are trained, the gates
will learn which data is crucial to preserve and which to
discard. Equations (9–12) may be used to calculate the value
of gates and cell state,
(7)
Entropy(x) = −
∑ (
P(x = k) ∗ log2(P(x = k)
)
(8)
f(x) = {0, if x 0x, if x ≥ 0
were DWf, DWi, DWo and DWc are divergence weights
linking layer’s contribution to the states of all gateways and
input cells Uf, Ui, Uo, Uc are the value vectors connecting
the inpu cell state and all of the gateways to the preceding
cell terminal side.bf, bi, bo,bc are bias vectors. σ and tanh
are, respectively, the sigmoid and tanh activation functions.
Cellular output state
(
Ct
)
, output layer
(
ht
)
, at every time
iteration t, is computed as byequation (13)-(14),
LSTM layer, the result vector for each of the outputs
is shown. as YT =
[
hT−n, … hT−1
]
.. Bidirectional LSTM is
based on bidirectional RNN (Houdt et al. 2020). It analyzes
successive data using two separate hidden layers, and trave-
ling ahead or backward links those layers to the same output
layer. Figure 3 shows a layered architecture for the Bi-LSTM
network.
The sequence input layer is also known as the first layer
embedding layer. As input, it uses the sorted chosen features
from the HD and BC datasets. Hidden forward and back-
ward LSTM layers are the second and third layers, giving
the 100-hidden unit Bi-LSTM layer. These two layers relate
present data to prior and future phases. Two data sequences
reach the system via the hidden layer. The outputs of the
hidden layers are integrated after processing to create the
Bi-LSTM layer’s final output. The following Eq. (15) may
be used to calculate the output from both LSTM layers,
when it accepts sequence from x1 and xT as input, h
f
t and hb
t
indicates the relative results of the advanced and reverse LSTM
layers. 𝛼 and 𝛽 are used to adjust the Bi-LSTM factors. At the
time, ht is dual bidirectional LSTM components. Bi-output
LSTM’s feeds a completely linked level with Five categories.
This layer links input characteristic data to output information
so subsequent layers can categorize them. Ultimately, the soft-
max and classification layers divide data into several classes.
(9)
ft = 𝜎
(
DWf xt + Uf ht−1 + bf
)
(10)
it = 𝜎
(
DWixt + Uiht−1 + bi
)
(11)
ot = 𝜎
(
DWoxt + Uoht−1 + bo
)
(12)
̃
Ct = tanh
(
DWcxt + Ucht−1 + bc
)
(13)
Ct = ft ∗ Ct−1 + it ∗ ̃
Ct
(14)
ht = ot*tanh
(
Ct
)
(15)
ht = 𝛼h
f
t + 𝛽hb
t
8. Int J Syst Assur Eng Manag
1 3
The softmax activations transform real vector values into in
the range 0 to 1, allowing probabilities to be understood. In
the somax regression (Wang et al. 2018), the probability of
classifying into a class may be calculated using Eq. (16).
where the value of K indicates the total amount of categories
𝜃 is denoted as the model parameters. The model gets the
results from the softmax function in the classification layer
and assigns every input for a class that makes use of the
cross-entropy function via Eq. (17).
N observations, K categories, tij denotes that the
ith
sam-
ple belongs to the
jth
class and yij. denotes the softmax value.
Weighting features is essential in classification because it
ensures that each feature has the same benefits in compari-
son to the targeted idea. Let us assume that when a given
feature value is seen, it provides a certain quantity of in
addition to calculating the comparative relevance of the
targeted characteristic, information is also provided to the
target feature of every distinguishing characteristic in the
categorization scheme,the discrepancy between prior and
posterior distributions of the target feature is used to define
the amount of information contained in a particular feature
value. The range of weight is computed using Kullback–Lei-
bler (KL) metric, which is computed using Eq. (18),
Here frij is the
ith
feature’s j value in training samples.
The weighted average of the KL measurements across the
(16)
P
�
y(i)
= k|x;θ
�
=
exp
�
θ(k)Tx
�
∑K
j=1 exp
�
θ(j)Tx
�
(17)
loss = −
N
∑
i=1
K
∑
j=1
tij ln yij
(18)
KL
(
C|frij
)
=
∑
C
P
(
c|frij
)
log
(
P
(
c|frij
)
P(c)
)
feature values is the feature weight. As a result, the weight
of feature i represented by Eq. (19) as fwavg(i),
P( frij) is the probability that the feature i has the value
of frij in this Eq. (19). Above weight fwavg(i) favors charac-
teristics that include a large amount of entries; as a result, a
range of records linked with each feature value is too less to
make any reliable learning. Equation (20) defines the final
form of the weight of feature i denoted as fw(i)
here Z is a normalization constant which is computed by
Eq. (21),
The value of n in equations represents the number of
selected features from the training data this Eq. (21). The
normalized form of fw(i) (Eq. (20) is presented in this work
in order to verify that
∑
i
fw(i) = n. Lastly, each gate in the
Bi-LSTM classifier is updated with this weight value. The
network’s hyper-parameters are initialized once the network
has been defined. The qualities on which the whole training
process is based are known as model’s hyper-parameters
where hyperparameters can be model-specific or optimiza-
tion specific. These parameters include epoch counts, batch
sizes, and learning rates which impact performances consid-
erably when optimized. Model-specific parameters are ele-
ments that impact structures like hidden units or layer
counts. These hyperparameters directly control training
(19)
fwavg(i) =
∑
j|i
P
(
frij
)
.KL
(
C|frij
)
(20)
fw(i) =
∑
j�i P
�
frij
� ∑
c P
�
c�frij
�
log
�P(c�frij)
P(c)
�
−Z.
∑
j�i P
�
frij
�
log
�
P
�
frij
��
(21)
Z =
1
n
∑
i
fw(i)
Fig. 3 DWBi-LSTM Network
Classes
Input
data
Padded
Input
data
1
2
3
………
……
Sorting
data
1
2
3
Embedding
Layer
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
ℎ1 ℎ1
ℎ2 ℎ2
ℎ3 ℎ3
ℎ ℎ
Fully
Connected
Layer
Softmax
Layer
Classification
Layer
9. Int J Syst Assur Eng Manag
1 3
processes which have significant influences on the resulting
performances of models. Hence, it is imperative to select the
right parameters for a model’s learning. A huge number of
trials are required to select optimal hyperparameters, which
may be consuming time making operations difficult. Clas-
sification accuracies of test sets are evaluated for obtaining
appropriate hyperparameters sets. Hence, this research work
diagnoses performances hyperparameters in both training
and validation datasets for selecting the right mix in terms
of classification results. This study considers learning rates,
batch sizes, epoch counts, and hidden unit counts as its
hyperparameters.
3.4.3 Bootstrap aggregation
Bootstrap aggregation uses replacement to select some sam-
ples from the training set at random. The term “bootstrap
replicate” refers to a fresh training set. The process of
obtaining bootstrap samples from data and training the clas-
sifier with each individual sample is referred to as bagging.
The combined votes of each classifier are tallied, and the
final outcome of the classification is decided according to
whichever vote has a greater number of affirmatives. Any
other classifier that uses a majority vote may be merged into
the majority voting classifier since it is a meta-classifier. The
ultimate label for the class is the one that was anticipated by
the vast majority of classifiers. The training set is used to
generate samples for the bootstrap aggregation method,
which selects them at random and replaces them. The new
training set is referred to by the moniker bootstrap replicate
throughout this article. The process of bagging entails
obtaining samples of the bootstrap from the raw data, and
then proceed to train the classifier using each individual
example. After tallying up the votes from each classifier, the
final outcome of the classification is decided according to
whichever vote had the majority of support.The class label
dJ is represented as dJ = mode
{
C1, C2, … Cn
}
, where
{C1, C2, … Cn} refers to individual classifiers that partici-
pated in voting. Let be ci,j the prediction of the
ith
classifier
on a class with j labels is represented as follows,
n
∑
i=1
ci,j = max
j=1,…,m
n
∑
t=1
ci,j.The algorithm for bootstrap aggrega-
tion is shown below.
4
Experiments and results discussion
The influence of imputation methods was measured depend-
ing on how well the predictions turned out. approaches in
this study. The experiments assessed and examined perfor-
mances based on detailed description of real datasets.
4.1 Datasets
This study uses the UCI Machine Learning Repository’s
WDBC and WOBC datasets for BCs, and its Hungarian and
Switzerland datasets for HD.
4.1.1 WDBC
Dataset, the characteristics of a digitized picture that belong
to FNA (Fine Needle Aspiration) of breast mass have been
added. The aforementioned features have defined the cell
nuclei feature in the picture. There are 569 data points in
the dataset, with 212 belonging to Malignant and 357 to
Benign. The dataset was divided into 10 groups based on
the following characteristics: fractal size, radius, symmetry,
texture, concave points, perimeter, concavity, area, compact-
ness, and smoothness. Three measures, such as Mean, SE,
and the mean of the three greatest values, are calculated for
each characteristic. As a result, there will be a total of 30
dataset features.
4.1.2 WOBC
The dataset contains 699 samples obtained from the UCI
repository. There are 458 benign samples and 241 malignant
samples in the group. In addition, the dataset contains ten
characteristics and one class. The class level is divided into
two categories: benign and malignant. In addition, the data-
base has missing data. The traits comprise the code number:
id; clump thickness: 1–10; cell size–1–10; cell shape–1–10;
peripheral bonding: 1–10; singular enterocytes size–1–10;
naked nuclei: 1–10; bland chromatomatin: 1–10; regular
nucleoli: 1–10; mitoses: 1–10; benign-2 and malignant-4
classes.
4.1.3
Heart disease (HD)
Only a subset of 14 of the 76 qualities that make up heart
disease (HD) are taken into account. Particularly, the Cleve-
land database was used the most by MLT. A number of 0
(absence) to 4 (presence) in the goal field indicates the
existence of cardiac disease. The tests performed on this
database aimed to detect the present or absent of illness.
4.1.3.1 Hungarian dataset It contains 294 samples with
14 features.
4.1.3.2 Switzerland dataset It contains 123 samples with
14 features.
10. Int J Syst Assur Eng Manag
1 3
4.2 Evaluation metrics
Following completion of the dataset, any missing values are
recognized, and the following metrics are utilized to calcu-
late Precision, Recall, Specificity, F-measure, and Accuracy.
Precision refers to the proportion of correctly recognized
good results.
Precision gives the proportion of positive predictions that
are actually correct. It can be calculated by Eq. (22),
Recall gauges the percentage of accurately anticipated
positive results that actually materialized. It can be calcu-
lated by Eq. (23),
The F-measure may be represented using Eq. (24),
Selectivity metric, which evaluates the percentage of
properly detected negatives in comparison to the overall neg-
ative predictions generated by the model, is often referred to
as the true negative rate. It is possible to represent it using
an Eq. (25),
Accuracy is a measurement that has been recognized as
being one of the highly acknowledged metrics to examine
the classification efficiency, which has been carried out
in this study for the purpose of cancer detection. It can be
expressed as follows by Eq. (26),
The imputation strategies result of the classifiers are
measured usingNRMSE to find the missing value. The
Eq. (27) to calculate NRMSE is described as follows (27),
In which the true worth is shown by represented by xi,
while simulated values is denoted by x
′
i
.
4.3 Results comparison
During the course of the tests, the suggested EDL classi-
fier was evaluated in comparison to four other classifying
(22)
Precision =
TP
FP + TP
(23)
Recall =
TP
TP + FN
(24)
F − measure =
2 ∗ (Recall ∗ Precision)
(Recall + Precision)
(25)
Specificity =
TN
FP + TN
(26)
Accuracy =
TP + TN
TP + FP + TN + FN
(27)
NRMSE =
1
max − min
√
√
√
√1
n
n
∑
i=1
(
xi − x
�
i
)2
strategies, including KNN, DT, ANFIS, and CNN. All these
classifiers are performed after the feature selection is com-
pleted via the AMCGWO. Table 1 shows the results com-
parison of feature selection with classifiers vs. datasets.
Figure 4 depicts precision value comparison of the BC
datasets (Fig. 4a) and the HD datasets (See Fig. 4b) with
earlier classifiers. For WDBC and WOBC, the suggested
MFWCP-EDL approach yields accuracy values of 98.7304%
and 98.1207%, respectively. For Switzerland and Hungary,
the similarly suggested MFWCP-EDL approach yields accu-
racy values of 98.5446% and 91.6667%, respectively. Addi-
tionally, there are additional techniques like WCP-KNN,
MFWCP-KNN, WCP-DT, MFWCP-DT, WCP-ANFIS,
and MFWCP-ANFIS, WCP-CNN, and MFWCP-CNN has
the precision value of 56.3333%, 60.9211%, 56.8750%,
71.2698%, 73.4848%, 76.3158%, 83.3333%, and 85.7143%
respectively for the Switzerland dataset (See Table 1). The
results of the proposed system have higher precision due to
optimal selection of features from the AMCGWO algorithm,
it exactly predicts the true positive results.
Recall results comparison of various classifiers with
imputation methods for BC and Heart Disease (HD) datasets
are shown in Fig. 5a, b respectively. For WDBC and WOBC,
the suggested MFWCP-EDL approach yields recall values
of 98.6364% and 98.3235%, respectively. For the Swiss and
Hungarian datasets, the similarly suggested MFWCP-EDL
approach yields recall values of 99.1150% and 98.9132%,
respectively. In addition, there are more techniques like
WCP-KNN, MFWCP-KNN, WCP-DT, MFWCP-DT, WCP-
ANFIS, and MFWCP-ANFIS, WCP-CNN, and MFWCP-
CNN shows the recall value of 60.7843%, 75.7080%,
68.6275%, 85.5752%, 80.3922%, 96.0177%, 97.7876%,
and 98.2301% respectively for Switzerland dataset. The
proposed system has higher sensitivity results due to opti-
mal features selection by AMCGWO algorithm; it exactly
predicts the actual data correctly.
Figure 6 displays the comparison of the F-measure value
with categorization methods in relation to the BC data-
sets (Fig. 6a) and the datasets for heart disease (HD) (See
Fig. 6b). For WDBC and WOBC, the suggested MFWCP-
EDL technique has values of 98.6834% and 98.2220%,
respectively. For the Hungarian and Swiss databases,
the identically suggested MFWCP-EDL approach yields
F-measure values of 98.7286% and 95.2455%, respectively.
In addition, there are more techniques like WCP-KNN,
MFWCP-KNN, WCP-DT, MFWCP-DT, WCP-ANFIS, and
MFWCP-ANFIS, WCP-CNN, and MFWCP-CNN shows the
F-measure value of 58.7647%, 67.5143%, 61.8756%, 77.77
02%,76.7835%,85.0405%,89.9837%, and 91.5464%respec-
tivelyfor Switzerland dataset. The suggested method accu-
rately predicts the real data by increasing f-measure by the
average while reducing the total amount of characteristics
in the database.
12. Int J Syst Assur Eng Manag
1 3
Specificity evaluation of various classifiers for Fig. 7
is an illustration of BC datasets. For WDBC and WOBC,
the suggested MFWCP-EDL technique yields specific-
ity values of 87.6768% and 87.3987%, correspondingly.
For the Swiss and Hungarian datasets, the identically sug-
gested MFWCP-EDL technique yields particular values of
88.1023% and 87.9229%, correspondingly. The alternative
techniques, include WCP-KNN, MFWCP-KNN, WCP-DT,
MFWCP-DT, and WCP-ANFIS, MFWCP-ANFIS, WCP-
CNN, and MFWCP-CNN shows the specificity of 54.0305%,
67.2960%, 61.0022%, 76.0669%, 71.4597%, 85.3491%,
86.9223%, and 87.3156% respectively for Switzerland data-
set (See Table 1).
Overall accuracy comparison of several classifiers with
imputation, Fig. 8 shows datasets with regard to BC. For
WDBC and WOBC, the proposed MFWCP-EDL tech-
nique demonstrates superior accuracy with 98.7698% and
98.2500%, correspondingly. For the Swiss and Hungarian
datasets, the identically suggested MFWCP-EDL approach
yields correctness of 98.9796% and 98.3740%, correspond-
ingly. In addition, there are more techniques like WCP-KNN,
MFWCP-KNN, WCP-DT, MFWCP-DT, WCP-ANFIS, and
MFWCP-ANFIS., WCP-CNN, and MFWCP-CNN gives
the accuracy value of 70.2703%, 80.4878%, 83.7838%,
90.2439%, 91.8919%, 92.6829%, 95.9350%, and 96.7480%
respectively for Switzerlanddataset.The proposed system has
increased accuracy by correctly classifying the samples as
positive, and negative.
Figure 9 depicts the NRMSE evaluation of classifica-
tion methods on BC and HD datasets. The methods like
WCP-KNN, MFWCP-KNN, WCP-DT, MFWCP-DT,
WCP-ANFIS, MFWCP-ANFIS, WCP-CNN, and MFWCP-
CNN gives the NRMSE of 0.5452, 0.4417, 0.4027, 0.3123,
0.2847, 0.2705, 0.2016, and 0.1803 respectively for the
Switzerland dataset.
Fig. 4 Precision Comparison VS Classifiers
Fig. 5 Recall Value Comparison Vs Classifiers
13. Int J Syst Assur Eng Manag
1 3
5
Conclusion and future work
In many classification issues when the feature training
matrix is missing, imputation of missing data is a typical
use. Simultaneously, feature selection becomes more sig-
nificant, particularly in data sets with a huge number of ele-
ments and variables. Imputation of missing data and feature
selection, and classification issues are solved for multiple
disease diagnoses. Initially, missing value imputation is
done by Bayesian network (BN) and optimal missing data
imputation reconstruction of data is performed by the pro-
cess of factoring tensors via MFWCP for the imputed data.
AMCGWO is a wrapper-based strategy for picking just the
optimum characteristics. It collects the average value of the
characteristic and introduces the ICMIC map to improve
GWO’s performance. EDL improves the efficiency of
disease-diagnosis algorithms. EDL is built on ECNN and
DWBi-LSTM through bootstrap aggregation. In the ECNN
classifier, the weight and bias of the classifier are calcu-
lated via the entropy with feature-based importance. In the
DWBi-LSTM classifier, weight features play a major vital
role in classification that every feature has similar impor-
tance regarding target concept by Kullback–Leibler (KL)
divergence. The majority voting classifier merges classifiers
by majority vote. Most classifiers anticipated the final class
label. In order to forecast illnesses, classifiers’ outcomes are
assessed.When compared against MFWCP-KNN, MFWCP-
DT, MFWCP-ANFIS, and MFWCP-CNN, sequentially for
the Switzerland dataset, the suggested MFWCP-EDL tech-
nique has produced a higher accuracy of 98.374%, which
is 17.8862%, 8.1301%, 5.6911%, and 1.626% greater than
Fig. 6 F-Measure Value Comparison Vs Classifiers
Fig. 7 Specificity Results Comparison vs. Classifiers
14. Int J Syst Assur Eng Manag
1 3
those methods’ corresponding previous bests. This study has
been expanded by integrating more datasets, and new deep
learning algorithms have also been incorporated to improve
the effectiveness of the classifier.
Declarations
Conflict of interest The authors declare they have no conflicts of
interest.
Research involving Human Participants and/or Animals In our
work, no animals or human are involved.
Informed consent Not applicable as no human or animal sample
was involved in this study.
References
Ang JC, Mirzal A, Haron H, Hamed HNA (2015) Supervised, unsu-
pervised, and semi-supervised feature selection: a review on gene
selection. IEEE/ACM Trans Comput Biol Bioinf 13(5):971–989
Atallah R, Al-Mousa A (2019) Heart disease detection using machine
learning majority voting ensemble method. In: Proceedings of the
2019 2nd International Conference on New Trends in Computing
Sciences (ICTCS), pp. 1–6, Amman, Jordan, October 2019.
Baccouche A, Garcia-Zapirain B, Castillo Olea C, Elmaghraby A
(2020) Ensemble deep learning models for heart disease classifi-
cation: a case study from Mexico. Information 11(4):1–28
Bashir S, Qamar U, Khan FH (2015) BagMOOV: A novel ensemble for
heart disease prediction bootstrap aggregation with multi-objec-
tive optimized voting. Australas Phys Eng Sci Med 38(2):305–323
De S, Chakraborty B (2020) Disease Detection System (DDS) Using
Machine Learning Technique. In Machine Learning with Health
Care Perspective (pp. 107–132). Springer, Cham.
Elgin Christo VR, Khanna Nehemiah H, Minu B, Kannan A (2019)
Correlation-based ensemble feature selection using bioinspired
algorithms and classification using backpropagation neural net-
work. Comput Math Methods Med 2019(7398307):1–17
Faris H, Aljarah I, Al-Betar MA, Mirjalili S (2018) Grey wolf opti-
mizer: a review of recent variants and applications. Neural Com-
put Appl 30(2):413–435
Franzin A, Sambo F, Di Camillo B (2017) BNSTRUCT: an R package
for Bayesian network structure learning in the presence of missing
data. Bioinformatics 33(8):1250–1252
Gandomi AH, Yang X-S (2014) Chaotic bat algorithm. J Comput Sci
5(2):224–232
Kadam VJ, Jadhav SM, Vijayakumar K (2019) Breast cancer diagnosis
using feature ensemble learning based on stacked sparse autoen-
coders and softmax regression. J Med Syst 43(8):1–11
Latha CBC, Jeeva SC (2019) Improving the accuracy of prediction of
heart disease risk based on ensemble classification techniques.
Inform Med Unlocked 16:1–9
Liu K, Kang G, Zhang N, Hou B (2018) Breast cancer classifica-
tion based on fully-connected layer first Convolutional neural
networks. IEEE Access 6:23722–23732
Fig. 8 Accuracy Results Comparison Vs Classifiers
Fig. 9 NRMSE Value Comparison Vs Classifiers
15. Int J Syst Assur Eng Manag
1 3
Malvia S, Bagadi SA, Dubey US, Saxena S (2017) Epidemiol-
ogy of breast cancer in Indian women. Asia Pac J Clin Oncol
13(4):289–295
Masud M, Rashed AEE, Hossain MS (2020) Convolutional neural
network-based models for diagnosis of breast cancer. Neural
Comput Appl, pp.1–12.
Raza K (2019) Improving the prediction accuracy of heart disease
with ensemble learning and majority voting rule,” InU-Health-
care Monitoring Systems, pp. 179–196, 2019.
Ren Y, Zhao P, Sheng Y, Yao D, Xu Z (2017) Robust softmax regres-
sion for multi-class classification with self-paced learning.
In: Proceedings of the
26th
International Joint Conference on
Artificial Intelligence (pp. 2641–2647).
Rokach L, Schclar A, Itach E (2014) Ensemble methods for multi-
label classification. Exp Syst Appl 41(16):7507–7523
Sahoo AK, Pradhan C, Das H (2020) Performance evaluation of
different machine learning methods and deep-learning based
Convolutional neural network for health decision making.
In Nature inspired computing for data science (pp. 201–212).
Springer, Cham.
Sainath TN, Mohamed AR, Kingsbury B, Ramabhadran B (2013)
Deep Convolutional neural networks for LVCSR. In: Proceed-
ings of the 38th IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP ’13), pp. 8614–8618,
2013.
Sapra L, Sandhu JK, Goyal N (2021) Intelligent method for detection
of coronary artery disease with ensemble approach. Advances in
Communication and Computational Technology, vol. 1033–1042,
2021.
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017)
A survey on semi-supervised feature selection methods. Pattern
Recogn 64:141–158
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2020)
A review of unsupervised feature selection methods. Artif Intell
Rev 53(2):907–948
Song L, Smola A, Gretton A, Borgwardt KM, Bedo J (2017) Super-
vised feature selection via dependence estimation,” in Proceed-
ings of the
24th
international conference on Machine learning, pp.
823–830, ACM, Corvallis, OR, USA, June 2017.
Van Houdt G, Mosquera C, Nápoles G (2020) A review on the long
short-term memory model. Artif Intell Rev 53(8):5929–5955
Vazifehdan M, Moattar MH, Jalali M (2019) A hybrid Bayesian net-
work and tensor factorization approach for missing value impu-
tation to improve BC recurrence prediction. J King Saud Univ-
Comput Inform Sci 31(2):175–184
Wang J, Wen G, Yang S, Liu Y (2018) Remaining useful life estimation
in prognostics using deep bidirectional LSTM neural network.
In 2018 Prognostics and System Health Management Conference
(PHM-Chongqing) ,pp. 1037–1042.
Xue B, Zhang M, Browne WN, Yao X (2015) A survey on evolutionary
computation approaches to feature selection. IEEE Trans Evol
Comput 20(4):606–626
Xu J, Tang B, He H, Man H (2016) Semisupervised feature selection
based on relevance and redundancy criteria. IEEE Trans Neural
Netw Learn Syst 28(9):1974–1984
Yang F, Shang F, Huang Y, Cheng J, Li J, Zhao Y, Zhao R (2017)
LFTF: a framework for efficient tensor analytics at scale. Proceed
VLDB Endowment 10(7):745–756
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of
such publishing agreement and applicable law.