Hussain et al BC Deep Learning March 2023.pdf

Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=twrm20
Waves in Random and Complex Media
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/twrm20
Deep convolutional neural networks accurately
predict breast cancer using mammograms
Lal Hussain, Sara Ansari, Mamoona Shabir, Shahzad Ahmad Qureshi, Amjad
Aldweesh, Abdulfattah Omar, Zahoor Iqbal & Syed Ahmed Chan Bukhari
To cite this article: Lal Hussain, Sara Ansari, Mamoona Shabir, Shahzad Ahmad Qureshi,
Amjad Aldweesh, Abdulfattah Omar, Zahoor Iqbal & Syed Ahmed Chan Bukhari (2023): Deep
convolutional neural networks accurately predict breast cancer using mammograms, Waves in
Random and Complex Media, DOI: 10.1080/17455030.2023.2189485
To link to this article: https://doi.org/10.1080/17455030.2023.2189485
Published online: 14 Mar 2023.
Submit your article to this journal
View related articles
View Crossmark data

WAVES IN RANDOM AND COMPLEX MEDIA
https://doi.org/10.1080/17455030.2023.2189485
Deep convolutional neural networks accurately predict breast
cancer using mammograms
Lal Hussaina,b, Sara Ansaric, Mamoona Shabird, Shahzad Ahmad Qureshie,
Amjad Aldweeshf, Abdulfattah Omarg, Zahoor Iqbalh and Syed Ahmed Chan Bukharii
aDepartment of Computer Science & IT, Neelum Campus, The University of Azad Jammu and Kashmir,
Muzaﬀarabad, Pakistan; bDepartment of Computer Science & IT, King Abdullah Campus, The University of
Azad Jammu and Kashmir, Muzaﬀarabad, Pakistan; cThe Children’s Hospital, University of Child Sciences,
Lahore, Pakistan; dServices Institute of Medical Sciences, Lahore, Pakistan; eDepartment of Computer and
Information Sciences, Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad, Pakistan;
fCollege of Computer Science and Information Technology, Shaqra University, Shaqra, Saudi Arabia;
gDepartment of English, College of Science & Humanities, Prince Sattam Bin Abdulaziz University, Al Kharj,
Saudi Arabia; hDepartment of Mathematics, Quaid-i-Azam University, Islamabad, Pakistan; iHealthcare
Informatics, St. John’s University, Queens, NY, USA
ABSTRACT
Breast cancer in women is the most frequently diagnosed and major
leading cause of cancer deaths. Due to the complex nature of micro-
calcification and masses, radiologists fail to diagnose breast can-
cer properly. In this research paper, we have employed a novel
Deep Convolutional Neural Network (DCNN) model using a transfer
learning strategy and compared the results with Machine Learning
(ML) techniques such as Support vector machine (SVM) kernels and
Decision Trees based on different features extracting strategies to
distinguish cancer mammograms from normal subjects. In this study,
we first extracted the hand-crafted features such as as texture, mor-
phological, entropy-based, scale-invariant feature transform (SIFT),
and elliptic Fourier descriptors (EFDs) and fed into machine learn-
ing algorithm for classification. We then utilized the deep learning
algorithms with transfer learning approach. The deep learning mod-
els yielded the highest detection performance with default and
optimized parameters i.e. GoogleNet yielded accuracy (99.26%),
AUC (0.9998) with default parameters and AlexNet yielded accuracy
(99.26%), AUC (0.9996) with optimized parameters. The results reveal
that proposed approach is more robust for early detection of breast
mammogramswhichcanbebestutilizedforimproveddiagnosisand
prognosis.
ARTICLE HISTORY
Received 29 November 2021
Accepted 20 February 2023
KEYWORDS
Breast cancer; deep learning
(DL); convolutional neural
network (CNN); GoogleNet;
AlexNet; support vector
machine (SVM); scale
invariant feature transform
(SIFT)
1. Introduction
Breast cancer is among women most frequently diagnosed cancers. In developing coun-
tries, breast cancer accounts for 23% of the total cancer cases, and 1.6 million new cases
of breast cancer are estimated worldwide, affecting women [1–3]. Breast cancer accounts
for nearly one in three cancers among US women excluding skin cancer and is the second
CONTACT Lal Hussain lall_hussain2008@live.com; Amjad Aldweesh a.aldweesh@su.edu.sa
© 2023 Informa UK Limited, trading as Taylor & Francis Group

2 L. HUSSAIN ET AL.
leading cause of cancer death among women after lung cancer [4]. In 2016, about 29%
of deaths were accounted in females due to breast cancer in the United States State. In
2016, it was estimated that 595,690 Americans would die from cancer, corresponding to
1600 deaths per day [5]. The most common causes of cancer deaths are lung and bronchus,
prostate, and colorectal cancers in men, and for women, these include lung and bronchus,
breast, and colorectal cancers. The invasive cancer lifetime probability of being diagnosed
in men (42%) is higher than in women (38%). This may be reflected due to external dif-
ferences in environmental exposure, endogenous hormones, and complex interaction
between these influences. Cancer incidences and deaths in both men and women are asso-
ciated with an adult height determined by genetics and childhood nutrition accounting for
1/3 of 6 differences in cancer risk [5,6]. The cancer risk for adults younger than 50 years is
higher in women (5.4%) than for men (3.4%) because of the relatively high burden of breast,
genital, and thyroid cancers in young women [7].
The early diagnosis and detection of breast cancer can decrease the death rate and
provide means for prompt treatment. Breast cancer is diagnosed and detected using a com-
bination of approaches, including imaging, physical examination, and biopsy [8]. One of
the imaging techniques used to detect breast cancer is mammography, where X-rays are
used to create images, known as mammograms, of the breast. Radiologists are trained to
read mammograms to detect the signs of breast cancer. The effectiveness of the screen-
ing process can rely on radiologists’ explanations [9]. Patients affected by palpable breast
cancer may have a sonogram and mammogram examination with both normal and benign
or nonspecific appearance [10]. The biopsy is used to confirm the symptoms of breast can-
cer, but it is an invasive surgical operation causing a psychological and physical impact on
patients. To avoid unnecessary biopsies, researchers have devised and investigated var-
ious computer-aided diagnosis (CAD) systems [3,11] providing stable detection rates by
identifying ultrasound & clinical features [12], using data mining classification techniques,
medical imaging and computer-aided diagnostics [13], and breast magnetic resonance
imaging (MRI) [14].
As far as mammography is concerned, the research evidence that radiologists may miss
up to 30% of breast cancers depending on the density of the breasts [15]. The mammo-
grams in breast cancer have been evaluated using two powerful indicators: masses and
micro-calcifications. Mass detection is more challenging than micro-calcification, not due
to the large variation in size and shape in which masses can appear in mammograms
but also because masses often exhibit poor image contrast [16]. Radiologists read mam-
mograms based on their experience, training, and subjective criteria. There may be a
65–75% inter-observer variation rate even by the trained experts [17]. Hence, computer-
aided diagnosis (CAD) may help radiologists to interpret mammograms to detect and
classify masses. The literature also reveals that about 65–90% of the biopsies of suspected
cancers turned out to be benign. Thus, it is essentially to develop techniques that can dis-
tinguish the malignant and benign lesions. The combination of computer-aided diagnosis
(CAD), expert knowledge, and Machine Learning (ML) techniques would greatly improve
detection accuracy. The detection accuracy without CAD was obtained below 80%, and
with computer-aided diagnosis (CAD) above 90% [18]. CAD can automatically identify
the area of abnormal contrast, calling the radiologist towards suspicious regions. Thus,
mammograms with computer-aided diagnosis (CAD) will improve the detection of can-
cer. The cancer masses and micro-calcifications in many cases are hidden in the intense

WAVES IN RANDOM AND COMPLEX MEDIA 3
breast tissues, especially in younger women, that are complex to detect and diagnose
cancer [3].
Features extraction is an important step to detect any pathologies from physiological
and neurophysiological systems. Likewise, time–frequency representation methods were
employed by [19] to determine the correlation and coupling between the brain waves dur-
ing resting states. Hussain et al. [20] extracted multimodal features based on fuzzy entropy
to detect arrhythmia, which outperformed the traditional features extracting approaches
and hybrid features [21] by employing regression methods to detect and predict epilep-
tic seizures. Moreover, to distinguish normal images from malignant subjects, researchers
extracted different imaging-related features. Karahaliou et al. [22] used a probabilistic neu-
ral network to diagnose breast cancer by extracting multi-scale texture properties of the
tissue surrounding the micro-calcifications. In the past few decades, other approaches have
also been used to detect and diagnose breast cancer, viz., a probabilistic algorithm and
radial gradient index-based algorithm [23], Convolution Neural Network (CNN) classifier
[24], and a mixed feature-based neural network [25], fractal geometry and analysis using
digital mammograms [26–28], and a method for automated segmentation of individual
micro-calcifications in a region of interest (ROI). Recently, Hussain et al. [29] computed
the associations between the morphological features extracted from the prostate cancer
images and found very stronger associations among the features.
In the past, researchers employed different hand-crafted feature-extracting strategies
such as texture, morphology, gray level co-occurrence matrix, histogram of oriented gra-
dients, scale-invariant feature transform, or a hybrid of these features for a brain tumor,
prostate cancer, and arrhythmia detection using ML and DL techniques [20,30,31]. The
existing techniques have some limitations; the graph-based techniques are competitively
expensive. The other computer-aided diagnosis (CAD) techniques based on texture fea-
tures exploited general texture features for classification and fail to provide the background
knowledge of morphological features. The machine learning methods based on differ-
ent feature-extracting strategies have limitations as different researchers employ different
feature-extracting methods. However, these classifiers are not fine-tuned for challenging
contrast existing in features.
With the advent of modern computational systems, ML-related Artificial Intelligence
application and graphical processing units (GPU) embedded processors have achieved
exponential growth by developing novel models and methodologies which is currently
knownasDL[32].TheDL-basedConvolutionNeuralNetwork(CNN)modeladoptsthearchi-
tecture of an artificial neural network that contains a much larger number of processing
layers which is contrary to the shallower architecture. CNN’s drastically reduce the struc-
tural elements (i.e. neurons) in comparison to traditional feedforward neural networks [32].
For image processing, different baseline architectures of CNNs have been developed and
successfully applied to complicated image-processing tasks.
The breast cancer diagnosis has accompanied classification and segmentation perfor-
mance improvement due to the representation learning, a characteristic of DL, due to its
auto-feature extraction proficiency as compared with the handpicked feature extraction
requirement in ML [33]. The learning phase is characterized by the flow of information
exhibiting the capability of self-leering [34]. In DL, the Bayesian framework determines
uncertainty in the model output using a Bayesian neural network [35,36]. Donald F. Specht
introduced a probabilistic neural network (PNN), using the Bayesian classification theory,

4 L. HUSSAIN ET AL.
consisting of three layers, viz. Input, Radial Basis, and Competitive layers [37,38]. PNN
has been used to categorize mammography images into normal, benign, and malignant
classes. The discrete wavelet transforms been used to find the input feature vector as
handpicked features. They used seventy-five mammograms in their study and claimed an
accuracy of 90%.
Zhang, Lin, et al. [39] introduced a three-stage neural network method to alleviate the
false positive rate of microcalcification in mammographic images. The microcalcification
was detected in the first stage, followed by the second stage, where the FP detection was
reduced from the first stage output. Lastly, in the third stage, the Kalman filter-based back
propagation neural network isolated the microcalcifications in the mammograms.
The DL networks using CNN achieved outclass performance for the detection and clas-
sification of masses and microcalcifications. In this context, Fukushima et al introduced a
light-weight CNN, known as ‘Recognition’, for medical image analysis [40,41]. Lo et al. [42]
introduced a CNN with multiple circular paths where information was first collected from
the suspected regions of mammograms, followed by processing as features using CNN.
Sahiner et al. [43] proposed a CNN for mammography where selected regions, extracted
by either averaging or subsampling were input to the CNN.
Jiao et al. [44] classified breast masses using a DL-based strategy where intensity-based
features were combined with CNN-extracted features using mammograms. Fonseca et al.
[45] used CNN with an SVM classifier for the classification of breast cancer. Su et al. [46]
introduced a rapid CNN method for breast cancer categorization where the semantic
segmentation was carried out to reduce redundant information at the cost of higher com-
plexity of the CNN model. Huynh et al. [47] used CNN by transfer learning to classify masses
and microcalcification. Arevalo et al. [48] introduced a method that did not use hand crafted
features where CNN was used to learn the data representation in a supervised learning
manner from biopsy images of 344 breast cancer patients.
Rezaeilouyeh et al. [49] proposed a microscopic breast cancer classification model using
CNN where the shearlet transform-based images were obtained as the feature vectors.
Subsequently, the shearlet coefficients were input to the CNN for classification. Jaffar [50]
proposed a method that was based on the enhancement as preprocessing of mammo-
grams, followed by CNN for feature extraction. The features were used to train the SVM
classifier. Jadoon et al. [51] introduced a dual deep neural networks-based classification
model for classes, viz. benign, malignant and normal. These algorithms were convolutional
neural network-discrete wavelet and convolutional neural network-curvelet transform. The
features extracted from discrete wavelet and curvelet transform based coefficients were
fused and fed to the CNN. The CNN was trained on softmax and SVM for classification.
Gastounioti et al. [52] used an ensemble classifier for breast cancer categorization. The
textural feature maps, obtained from lattice-based methods, were fed to the CNN for
multi-class categorization. Wang et al. [53] proposed a hybrid approach for breast can-
cer classification into benign and malignant classes. The cropping and clinical features are
extracted using multi-view patches of mammograms. Finally, the CNN was trained using
multiple features to focus on the regions related to semantic-based lesions. Zhu et al. [54]
introduced a combination of a fully convolutional network to segment the masses within
mammograms by using a conditional random field. The method estimated ROIs on empir-
ical basis with prior information on positions that helped to improve the prediction of
ROIs.

Ribli et al. [55] introduced Faster Regions with Convolutional Neural Networks (R-CNN)
forbreastcancerclassificationasbenignandmalignantcases.InFasterR-CNN,theROIpool-
ing method was used to extract the features that are fed to the VGG-16 model. The output
of the method resulted as bounding boxes with a confidence score that decides the class
of cancer. Chiao et al. [56] proposed an improved version of the region proposal network
called Mask R-CNN that was used for the detection and segmentation of cancer regions in
mammograms. The Mask R-CNN method used the ROI alignment technique. After the fea-
ture extraction from the ROI Align method, CNN was used for detection and classification
processes.Nahidetal.[57]usedLSTMfortheclassificationofmicrocalcificationsandmasses
by transforming mammograms into 1D-vector format, followed by conversion into time-
series data. A total of 7909 images were used from the BreakHis histopathological dataset
which were evaluated on SVM and Softmax at the decision layer.
In contrast, the DL convolution neural network models with TL approaches are fine-
tuned to optimize the parameters by minimizing the error. In this study, we have tested
the generalization of the breast cancer mammographic images through AlexNet [33], and
GoogleNet [58] as pre-trained CNN models using a TL approach verified in literature [59,60]
in the most widely used imaging datasets. The features and training data were desired to
lie within the same feature space. Transfer learning has the capability to allow the users to
extract pre-known expertise and apply it on the new domain by reducing overall computa-
tional time with the images lying in the combined feature space of two known TL methods
on a broader spectrum with marked discrimination in feature space. The widened solution
space, using the feature fusion, has resulted in the outclass performance.
2. Methods
2.1. Datasets
Datasets were taken from publicly available databases provided by the University of South
Florida [61] available online at (http://marathon.csee.usf.edu/Mammography/Database.
html). In DDSM images, suspicious regions of interest are marked by experienced radi-
ologists, and BI-RADS information is also annotated for each abnormal region. In our
experiment, we used mass instance images digitized by LUMYSIS. This dataset contains
approximately 2500 studies. We used the latest volumes of the DDMS database, i.e. 12 nor-
mal volumes and 15 cancer volumes, 15 containing a total of 899 images, including 500
cancer images having 105 cases and 399 normal subject images having 100 cases.
2.2. Convolutional neural network
Due to the outclass performance, CNNs have been used for breast cancer classification
[62]. An end-to-end CNN architecture was applied to classify the cancer images directly to.
To obtain high performance, we require a careful combination of pre-processing, TL, and
data augmentation. In this proposed work, the performance was evaluated using two net-
work architectures of CNN, namely AlexNet [33] and GoogleNet [58]. For both networks,
the same architecture was used only replacing the last fully connected (FC) layer to output
two classes. From GoogleNet, two auxiliary classifiers were removed. We also used batch
normalization to regularize the data flowing between neural network layers reducing the

6 L. HUSSAIN ET AL.
internalcovariateshift[63].Theinputof224 × 224 × 3imageswassuppliedtothenetwork.
CNN consists of convolution blocks composed of 3 × 3 convolutions – Batch Norm-ReLU-
Max Pooling, with respectively 32, 32, and 64 filters each followed by three fully connected
layersofsize128,64,and2.Thefinallayerisofsoft-maxforbinaryclassification.Inthisstudy,
we used default and optimized parameters as: Xavier’s [64] weight initialization, ReLU acti-
vation function, and Adam’s [62] update rule. We used a base learning rate of 10−04 and
mini-batch size of 20 and 64, while for optimized parameters, we used a momentum of 0.9,
an initial learning rate of 0.001, learning rate drop factor of 0.1, L2 regularization of 0.004,
batch size of 20, epoch 2, etc.
Let us consider input y (suppose depicted object in the image) using the model
y = f(x, θ). Since the model is not previously known, our aim is to use a generic model by
describing through a set of parameters θ that are specialized in the target task. This can be
done using a supervised ML approach by presenting a model using a set of input exam-
ples and labels pairing (x, y) and updating iteratively its parameters so that the obtained
output is near to the possibly original associated labels. The difference between the label
ŷ predicted from the model and desired label y, the loss function (y − ŷ) is employed. The
main purpose of this learning process is to select the parameter θ values that minimize
such a function. An optimization method is desired to adjust the parameter θ values from
the family of the gradient descent algorithm.
2.2.1. Deep learning ResNet101
ResNet101, named after its 101 layers of the residual network, contains a modified version
of ResNet 50 architecture. The ResNet model was originally proposed by He et al. in 2016
[32]. ResNet is an abbreviation for residual networks and has been employed in solving
numerous problems related to computer vision and its other applications. ResNet is one
of the deepest Convolutional Neural Network architectures used on large scales and has
been used for a wide range of applications in the ImageNet dataset (i.e. object detection
and recognition, various classification purposes). Generally, the multiple layers of a CNN
are interconnected to each other in a specified manner; these layers are trained to perform
various tasks. The basic idea behind ResNet architecture and its implementation is based
on residual network connections across which the gradients pass to inhibit the gradients
to zero after employing the chain rule [32]. ResNet101 has 104 convolutional layers along
with 33 filters (blocks), with one block for each layer, respectively. Nine out of 33 layers use
the output of previous layers directly, which is known as a residual connection. These resid-
ual connections are used as the first operand of the summation operator at the end of each
layer to obtain the input from other layers. The remaining 4 layers receive the output of the
previous block as input and employ it in the convolutional layer with a filter size of 1 × 1 and
a stride of 1, followed by a group of normalization layers. This normalization layer is used
to perform normalization operations, and then the obtained output is transferred to the
summation operator at the output of that block. The depth of each block may vary accord-
ing to the density of each block [65]. The general architecture of ResNet101 is reflected in
Figure 5. Moreover, Figures 6 and 7 reflect the replaced layers of ResNet101 before and after
fine-tuning (Figure 1).
The hyper-parameter settings found empirically for ResNet101 are depicted in Table 1.
The hyper-parameters of the CNN models were adjusted heuristically to facilitate the
convergence of the loss function during training. The Adam optimizer was chosen because

Figure 1. ResNet101 overall architecture.
Table 1. Empirically tuned set of parameters.
Model Parameter Value
ResNet101 Optimizer Adam
(TL Deep CNN) Momentum 0.90
Initial learning rate 0.0001
L2 Regularization 0.00004
Max epochs 10
Minibatch size 12
of its learning rate and the parameter-specific adaptive nature of the learning rates. The
initial learning rates were chosen as 0.0001 for ResNet101. A large learning rate may pre-
vent the loss of function from converging and could cause overshoots. An extremely small
learning rate drastically increases the training time. The mini-batch size of 10 and 12 was
set, according to the speed of training and computational requirements. Extremely large
values of batch size adversely affect the model quality.
2.2.2. GoogleNet
On a new set of cancer images, the GoogleNet was retrained. The weights of the earlier
layers were frozen in the network by setting the learning rate to zero. During the freezing
of the training layers, parameters were not updated because the gradients of these layers
were not computed, and this helped to improve the network performance significantly.
This property also helps to avoid overfitting the new dataset. The first 110 layers in the
GoogleNet include the inception module. By using the freezeWeights(), the learning rates
of the first 110 layers were set to zero. The layers in original order were reconnected using
the CreateLgraph() using the connection function while the earlier layers learning rate was
set to zero. Figure 2 illustrates the schematic diagram of GoogleNet model.
2.2.2.1. Train network-framework. As the training network require input images of the
size 224 × 224 × 3 and 227 × 227 × 3 for GoogleNet and AlexNet, respectively, but images
in the datasets have different size. So, we used imresize() function to resize the images of

8 L. HUSSAIN ET AL.
Figure 2. Schematic diagram of GoogleNet architecture.
different size equivalent to the input images size. The TL-based framework adopts ResNet-
101 (2048 features) and GoogleNet (1000 features) using mammograms. The features after
fusion (3048 features) were used for each image. The entire dataset was fed to the cross-
validation (10-fold) stage. The optimized model was used to determine the performance of
the test instances for discriminating the healthy and diseased subjects.
2.2.2.2. Transfer learning (TL) approach. We applied the TL approach, using networks
such as GoogleNet and AlexNet of CNN pre-trained on the ImageNet comprising of
inception-, convolution- and fully-connected-layers. The fully connected layers require
fixed image input for processing while convolution layers can work with arbitrary input
image size. To avoid overfitting in the training, the images are resized for GoogleNet and
AlexNet as 224 × 224 × 3 and CNN as 227 × 227 × 3. Moreover, for GoogleNet, we modi-
fied the dimension of the last fully connected layers from 1000 to 2. Likewise, the last fully
connected layer was also completely re-initialized randomly while all other layers main-
tained their weights from the pre-training. The shallow layers are general and low-level
image features, while deeper layers are high-level and task specific. Thus, the learning rate
of deeper layers should be larger than that of shallow layers. The batch size was set to 20,
the initial learning rate of 10−4, and maximum epochs of 6 using 378 iterations.
The CNN entire training from scratch can be cumbersome because a small dataset may
cause the problem of overfitting. To tackle this kind of problem, a TL technique is employed.
This technique can solve a new problem from previously learned knowledge with a better

solution by extracting knowledge from source tasks and applying knowledge to a target
task by applying the task T and domain D concepts.
Consider a Domain D = {χ, P(X)} comprising of a feature space χ and marginal prob-
ability distribution P(X), where X = {x1, x2, x3, . . . ..xn}χ. In Domain, D = {χ, P(X)} a task
T = {γ , f(.)} comprised of a label γ and objective predictive function f(.) learning from the
training data, comprised of a pair {xi, yi}, where xiχ and yiγ , i.e. predicting corresponding
label f(x) of a new instance x. Consider a source domain Ds and corresponding source task
Ts, target domain Dt and corresponding target task Tt using the knowledge in Ds and Dt TL
approach is aimed to help in improving the learning of target predictive function ft (.) in Dt
where Ds = Dt and Ts = Tt [66].
To employ TL on CNN, various approaches have been employed [67]. A CNN trained
previously on another task, say image classification using ImageNet dataset [68] can dis-
tinguish two approaches: (a) Fine tuning: Using this approach, the network parameters are
retrained by propagating back the error to the whole network [69], (b) Freezinglayers: Using
this approach, most of the transferred features remained unchanged during the training of
the new task. Due to this fact, the most common generic features are contained in the first
layer, which is common to many problems, while other layers progressively become more
specific to the target dataset [70].
Applying the proper type of TL to a specific task requires several factors into considera-
tion. The most important factors include the dataset size [71] and its similarity to the dataset
used in the originally trained network [72], viz. ImageNet. When the dataset is smaller than
the original dataset, the concept of the freezing layer approach is most feasible because
low-level features are also relevant for the target dataset. Moreover, the smaller dataset
may lead to overfitting when the fine-tuned approach is employed, suggested that when
the bigger data is available instead. The latter approach is also suitable when we have a
different dataset available than the original one.
2.2.2.3. Convolutional layer. In CNN, the Convolutional Layer is the main building block.
In the basic CNN, the convolution filter is a generalized linear model (GLM) for the underly-
ing local image patch. It works at the abstraction level and when the instances of latent are
separable linearly. This layer has learnable filter parameters and 3D matrices of numerical
values, which are spatially smaller than the input ones in terms of dimension. According
to the design choice, the width and height are fixed while the depth is fixed according to
the number of input channels i.e. the number of 2D inputs in the layer. These filters, during
the forward pass, slide across the height and width of the input at any position. The filter
slicing operation is translated mathematically into a dot product between the filter and the
input at any position. A 2D output result takes the name of the activation map, which will
be stacked along the depth dimension with the other activation maps to make the out-
put volume. By employing the zero padding techniques, the spatial size of the output is
controlled.
For convolutional layer l, the output of the ith filter is denoted by yl
i with total number of
C filters, mathematically expressed as:
yl
i = s
⎛
⎝
Ci−1

j=1
fl
i,j∗ yl−1
i + bl
⎞
⎠ (1)

10 L. HUSSAIN ET AL.
For layer l, the bias vector is denoted by bl, ith filter of the convolution layer is denoted by
fl
i,j which connect to the jth feature map of layer l-1, and activation function is represented
by s.
A convolution operation during the backward pass is also employed but filters are
flipped spatially along both axes for height and width. Using the backpropagation
algorithm, the parameter fl
i,j is updated and learned by the network. Using this approach,
the network is capable of learning various types of filters to solve any kind of tasks with their
specialized properties.
2.2.2.4. Pooling layer. The Convolution layer is followed by the Pooling layer. Its major
functionality is to reduce the spatial size of the input layer and to operate independently
on every depth slice. This layer is nonparametric and consists of filters that slide with a prior
fixed value of stride from the input layer to produce the output [32,73]. It used the filter
functions: Max Pooling and Average Pooling.
2.2.2.5. Fully connected layer. To convert the combined features in the class score, at
least one fully connected (FC) layer is present in CNN before the output of the network. In
this layer, each neuron is connected to all other neurons in the layers before it by consid-
ering the mesh topology strategy. The main function of this layer is to learn parameters
(biases and weights) to map the input layer to the corresponding output layer.
The output yl for FC layer l can be computed as given by:
yl
= s(yl−1
∗Wl
+ bl
) (2)
whereWl andbl denotetheweightsandbiasvectorsoflayerl,andsrepresenttheactivation
function. FC layers contrary to the convolution layer do not support parameter sharing. Due
to this property, the learnable parameters with CNN are substantially increased.
2.2.2.6. Activation function. The nonlinearity in the network to learn more complex
functions is determined by employing the activation function. In the DL framework, the
nonlinear transformation from input to output is performed using the activation functions
from the nonlinear layers and their combination with other layers [74,75]. Therefore, an
appropriate activation function is required for better feature-extracting strategy [33,76,77].
A brief overview of the most commonly used activation functions g () is given by:
The sigmoid function is given by: g(a) = 1
1+e−a , where a denotes the input from the front
layer. The values of the sigmoid function are transformed with values ranges from 0 to 1 and
commonly used to produce a Bernoulli distribution as given by:
g̃ =

0, if g(a) ≤ 0.5
1, if g(a) 0.5
(3)
The hyperbolic tangent function is given by: g(a) = tanh(a) = ea+e−a
ea+e−a , where the derivative
of g is determined by: g = 1 − g2, makes it comfortable to work with the BP algorithms.
The Softmax function is given by: g(a) = eai

j e
aj . This layer is used commonly as output
final layers that an be considered as a probability distribution over the categories.
The Rectified Linear Unit (ReLU) is the most widely used activation function as given
by: g(a) = max(0, a). Using gradient base algorithms, ReLU using the property of linear

models make them easy to optimize. This is easy to implement and greatly accelerate the
convergence of optimization methods [32,73]. A superior performance is shown using this
activation function and its variants. Moreover, in DL this activation function most popular
so far [77–80]. The gradient diffusion problems can also be solved using ReLU function
[74,81,82].
The Softplus function, a variant of ReLU, is given by: g(a) = log(1 + ea). The smooth
approximation of ReLU is computed using this function.
The absolute value rectification function is given by: g(a) = |a| is used for taking the
average value in CNNs by the pooling layer [81] being capable to preventing negative and
positive features from diminishing.
The Maxout function is given by: gi(x) = maxi(bi + wi.x). In this case, a three-
dimensionalarrayisusedforweightmatrix,theneighboringlayersconnectionscorrespond
to the third array [75].
2.2.2.7. Optimization objective. A regularization term and loss function are used to
compute the objective function. The discrepancy between the output of the network is
measured using the loss function which depends on the expected result y and the model
parameter (θ)f(x|θ). For example, in classification tasks denoted by true class labels and in
prediction tasks denoted by true level. Due to this ability, the learning algorithm not only
performs well on training data but also on testing data. The test error-reducing strategy
is known as regularization [74,75]. To prevent overly complex models, some regularization
parameters apply penalties to the parameters. The commonly used loss function and reg-
ularization parameters are represented by L(f(θ)) and Ω(θ). The optimization objective is
defined as given by:
L̃(X, y, θ) = L(f(θ), y) + α (θ) (4)
where α represents the balance of these two components, and pragmatically the loss func-
tion is computed usually across the randomly sampled training samples rather than the
data generating distribution because the latter is unknown.
2.2.2.8. Lossfunction. Mostnetworksusedcrossentropybetweenthemodeldistribution
and training data as the loss function. The commonly used cross entropy is the nega-
tive conditional log-likelihood as given by: L(f(θ), y) = − log log P(x, θ), which represents
the loss function collection corresponding to the distribution y gives the value of input
variable x. Consider the following commonly used loss function. Suppose y is a contin-
uous function and has Gaussian distribution over a given variable x. The loss function is
given by:
L(f(θ), y) = − log
1
2πσ2
exp exp
−1
2σ2
(y − f)2
(5)
=
1
2σ2
(y − f)2
+
1
2
log log(2πσ2
) (6)
This is described equivalently to the squared error which was the most commonly used
loss function in the 1980s [74,75]. However, the outliers are excessively penalized leading
to slower convergence rates [83]. Consider, the output variable y following the Bernoulli

distribution, then the loss function is represented as:
L(θ), y) = −y log f(θ) − (1 − y) log(1 − f(θ) (7)
Where y is discrete and has only two values, for example, y(1, 2, 3 . . . ., k), then we can use
the Softmax value as the probability over the categories, then the loss function will be.
L(f(θ), y) = − log

eay

j eaj

(8)
= ay + log
⎛
⎝

j
eaj
⎞
⎠ (9)
2.2.2.9. Regularization term. For regularization, the parameter L2 is commonly used,
which contributes to the convexity of the optimization objective by converging to the min-
imum of the solution using Hessian matrix [66,84]. The regularization parameter L2 can be
defined as follow:
Ω(θ) =
1
2
||ω||2
(10)
The networks connecting unit weights are represented by Ω.
2.2.3. Performance evaluation parameters
Breast cancer and normal subjects are classified using ML classifiers, and performance is
measured by computing sensitivity, specificity, PPV, NPV, and Total Accuracy.
2.2.3.1. Sensitivity. The sensitivity measure is used to test the proportion of people who
test positive for the disease among those who have the disease. Mathematically, it is
expressed as:
Sensitivity =
TP
TP + FN
(11)
2.2.3.2. Specificity. Specificity measures the proportion of negatives that are correctly
identified. Mathematically, it is expressed as:
Specificity =
TN
TN + FP
(12)
2.2.3.3. Positive predictive value (PPV). It is mathematically being expressed as:
PPV =
TP
TP + FP
(13)
2.2.3.4. Negative predictive value (NPV). It is mathematically being expressed as:
NPV =
TN
TN + FN
(14)

2.2.3.5. Total accuracy (TA). The total accuracy is computed as:
TA =
TP + TN
TP + FP + FN + TN
(15)
2.2.4. Training/testing data formulation
The Jack-knife k-fold cross-validation (CV) technique was applied for training/testing data
formulation and parameter optimization. In this research, 2,4,5, and 10-fold CVs were used
to evaluate the performance of classifiers for different feature extracting strategies. The
higherperformancewasobtainedusinga10-foldCV,wherethedataisdividedinto10folds,
in training, the 9 folds participate and classes of samples of remaining folds are predicted
based on the training performed on 9 folds. For the trained models, the test samples in the
test fold are purely unseen. The entire process is repeated 10 times and each class sam-
ple is predicted accordingly. A similar approach is applied to other CVs. Finally, the unseen
samples predicted labels are used to determine the classification accuracy.
2.2.5. Receiver operating curve (ROC)
The ROC is plotted against the true positive rate (TPR) i.e. sensitivity and false positive rate
(FPR) i.e. specificity values of prostate and brachytherapy subjects. The mean features val-
ues for brachytherapy subjects are classified as 1 and for prostate, subjects are classified as
0. This vector is then passed the ROC function, which plots each sample value against speci-
ficity and sensitivity values. To diagnose and visualize the performance of a classifier, ROC is
one of the standard ways to measure performance [85]. The TPR is plotted against the y-axis
and FPR is plotted against the x-axis. The area under the curve (AUC) shows the portion of
a square unit. Its value lies between 0 and 1. Seemingly, AUC 0.5 shows the separation.
The higher AUC shows a better diagnostic system. Correct positive cases divided by the
total number of positive cases are represented by TPR, while negative cases predicted as
positive divided by the total number of negative cases are represented by FPR.
3. Results
In this research, we have employed DL CNN models using a TL approach to detect breast
cancer. We also extracted multimodal features such as texture, morphological, SIFT, EFDs,
and entropy from these mammograms and applied ML classifiers such as the Bayesian
approach, Support Vector Machine (SVM) kernels – Polynomial, RBF, Gaussian and Deci-
sion Tree. Using the TL approach, we trained the GoogleNet and AlexNet pre-trained
models with 500 Breast and 399 Normal mammograms. The features are then extracted
using the Softmax layer. The performance was evaluated in terms of sensitivity, specificity,
Positive predictive value (PPV), negative predictive value (NPV), total accuracy (TA), false
positive rate (FPR) and area under the receiver operating curve (AUC) as reflected in Table
1 and Figures 3–6. For ML methods, four stages namely pre-processing, features extrac-
tion, classification, training/test data formulation, and classification of images into normal
and cancer/malignant using SVM, Decision Tree and Bayesian classifier, were employed as
detailed in [50]. The texture, morphological, entropy, SIFT, and EFDs features are extracted
as discussed by [31,86,87]. In the DL TL approaches, we resized the images according to the
network requirements and then trained the GoogleNet and AlexNet pre-trained models
with a new set of cancer images.

Figure 3. Transfer learning-based proposed framework for detection of masses and microcalciﬁcation
using mammographic images.
Figure 4. Performance evaluation using ML and DL methods.
Using ML classifiers, with Naïve Bayes, the highest performance in terms of total accuracy
(TA) was obtained with SIFT feature (TA = 57.54%) followed by Entropy (TA = 56.06%),
Texture, Morphological, EFDs with (TA = 55.84%). The other performance metrics for the
Bayes classifier are reflected in Table 1. Using the SVM polynomial classifier, the highest
performance was obtained with texture feature (TA = 82.65%) followed by morphologi-
cal and entropy (AUC = 82.42%), EFDs (TA = 77.42%) and SIFT (TA = 67.49%). The SVM

Figure 5. PerformancemeasureinformofAUCusingMLmethodsusing(a)EntropyFeatures,(b)Texture
Features and DL Methods (c) AlexNet, (d) GoogleNet.
RBF gives the highest performance with entropy (TA = 85.21%) followed by Morphological
(TA = 84.20%), texture (TA = 83.98%), SIFT (73.68%) and EFDs (TA = 72.75%). Moreover,
using SVM Gaussian, the highest performance was obtained with entropy (TA = 84.87%)
followed by morphological (TA = 83.43%), texture (TA = 83.31%), SIFT (TA = 74.39%) and
EFDs (TA = 73.75%). The ML Decision tree classifier gives the highest performance with
entropy (TA = 85.65%) followed by morphological (TA = 84.87%), SIFT (TA = 74.04%),
texture (TA = 55.17%) and EFDs (TA = 47.16%). Using DL-CNN models, the highest perfor-
mance was obtained using GoogleNet with default parameters AlexNet with optimized
parameters (TA = 99.42%) followed by AlexNet with default parameters (TA = 98.89%),
and GoogleNet with optimized parameters (TA = 98.03%). The other performance metrics
in terms of sensitivity, specificity, PPV, NPV, FPR and AUC are reflected in Table 2.
Figure 4 depicts the evaluation performance using ML classifiers and CNN methods
to detect breast cancer. For ML, different features are extracted, such as texture, mor-
phology, entropy, SIFT and EFDs where these classifiers outer performed, and results are
compared with CNN methods. Using the Bayes classifier, SIFT features outer performed
with sensitivity (57.54%), specificity (43.81%), PPV (75.68%), NPV (81.62%), TA (57.54%),
FPR (0.5619) and AUC (0.5088). Using the SVM polynomial kernel, the texture feature gives
the highest performance with sensitivity (82.55%), specificity (82.46%), TA (82.55%) and

Figure 6. Performance evaluation using GoogleNet with initial parameters and 378 iterations.
AUC (0.5045). SVM RBF gives the highest performance using entropy features obtain-
ing sensitivity (85.21%), specificity (83.95%), TA (85.21%) and AUC (0.8857). Likewise, SVM
Gaussian with entropy features gives the highest performance with sensitivity (84.87%),
specificity (83.07%), TA (84.87%), and AUC (0.8779). The Decision tree classifier gives the
highest performance using entropy features with sensitivity (85.65%), specificity (84.75%),
TA (85.65%) and AUC (0.9173). The performance using CNN methods was evaluated
using GoogleNet and AlexNet with default and optimized parameters. DL GoogleNet with
default parameters gives the performance of sensitivity (99.26%), specificity (99.24%), PPV
(99.26%), TA (99.26%), FPR (0.00076), and AUC (0.9998). GoogleNet with optimized param-
eters gives the performance of sensitivity (98.15%), specificity (98.19%), PPV (98.15%), NPV
(98.03%), TA (98.15%), FPR (0.0181) and AUC (0.9983). Similarly, DL CNN AlexNet method
with default (auto) parameters gives sensitivity (98.89%), specificity (98.94%), PPV (98.89%),
NPV (98.78%), TA (98.89%), FPR (0.0106) and AUC (0.9981). Moreover, AlexNet with opti-
mized parameters gives the performance of sensitivity (99.26%), specificity (99.07%), PPV
(99.27%), NPV (99.42%), TA (99.26%), FPR (0.00093) and AUC (0.9996).
Figure 5 depicts the performance evaluation in terms of AUC to separate breast can-
cer subjects from normal subjects using ML classifiers with a different set of features
which outer performed and CNN methods. Using entropy features, the highest separa-
tion was obtained using Decision Tree with (AUC = 0.9173) followed by SVM RBF with
(AUC = 0.8857), SVM Gaussian (AUC = 0.8779) and Naïve Bayes, and SVM polynomial with
(AUC = 0.507). Similarly, with texture features, the highest separation was obtained using
SVM RBF with (AUC = 0.8968) followed by SVM Gaussian with (AUC = 0.8918), Decision
Tree with (AUC = 0.6878) and Naïve Bayes SVM Polynomial with (AUC = 0.5045) as
reflected in Figure 5(a–b). The performance in terms of AUC using DL GoogleNet with

Table 2. Performance evaluation based on Different extracted features using ML Classifiers and TL
Approaches using DL Methods.
Features Sensitivity Specificity PPV NPV TA FPR AUC
Bayes
Texture 0.5584 0.4466 0.7538 0.8036 0.5584 0.5534 0.5045
Morphological 0.5584 0.4466 0.7538 0.8036 0.5584 0.5534 0.5045
SIFT 0.5754 0.4381 0.7568 0.8162 0.5754 0.5619 0.5088
EFDs 0.5584 0.4466 0.7538 0.8036 0.5584 0.5534 0.5045
Entropy 0.5606 0.4494 0.7545 0.8041 0.5606 0.5506 0.507
SVM polynomial
Texture 0.8265 0.8246 0.8271 0.821 0.8265 0.1754 0.5045
Morphological 0.8242 0.8213 0.8246 0.8191 0.8242 0.1787 0.5045
SIFT 0.6749 0.6547 0.6729 0.6624 0.6749 0.3453 0.5088
EFDs 0.7742 0.7712 0.775 0.7677 0.7742 0.2288 0.5045
Entropy 0.8242 0.8213 0.8246 0.8191 0.8242 0.1787 0.507
SVM RBF
Texture 0.8398 0.8317 0.8396 0.8383 0.8398 0.1683 0.8968
Morphological 0.8420 0.8375 0.842 0.8383 0.8420 0.1625 0.9069
SIFT 0.7368 0.7248 0.7364 0.7268 0.7368 0.2752 0.7948
EFDs 0.7275 0.6929 0.7343 0.7406 0.7275 0.3071 0.7940
Entropy 0.8521 0.8359 0.8546 0.8597 0.8521 0.1641 0.8857
SVM Gaussian
Texture 0.8331 0.8269 0.8329 0.8300 0.8331 0.1731 0.8918
Morphological 0.8343 0.8318 0.8346 0.8292 0.8343 0.1682 0.9109
SIFT 0.7439 0.7274 0.7427 0.7356 0.7439 0.2726 0.7990
EFDs 0.7375 0.7055 0.7433 0.7492 0.7375 0.2945 0.7945
Entropy 0.8487 0.8307 0.8522 0.8585 0.8487 0.1693 0.8779
Decision tree
Texture 0.5517 0.6013 0.6028 0.5814 0.5517 0.3987 0.6878
Morphological 0.8487 0.8443 0.8487 0.8451 0.8487 0.1557 0.9117
SIFT 0.7404 0.7235 0.7391 0.732 0.7404 0.2765 0.8039
EFDs 0.4716 0.5566 0.5412 0.5232 0.4716 0.4434 0.5175
Entropy 0.8565 0.8475 0.8565 0.8567 0.8565 0.1525 0.9173
DL
GoogleNet AutoP 0.9926 0.9924 0.9926 0.9924 0.9926 0.0076 0.9998
GoogleNet DiffP 0.9815 0.9819 0.9815 0.9803 0.9815 0.0181 0.9983
AlexNet AutoP 0.9889 0.9894 0.9889 0.9878 0.9889 0.0106 0.9981
AlexNet DiffP 0.9926 0.9907 0.9927 0.9942 0.9926 0.00093 0.9996
Legends: AutoP (Auto/default parameters), DiffP (Different/Optimized Parameters).
default parameters was obtained with (AUC = 0.9998) and AlexNet with an optimized set
of parameters as (AUC = 0.9996).
Figure 6 depicts the performance using GoogleNet with default parameters for 6 epochs
and 378 iterations. For each training and validation, the accuracy was observed lower in
the 1st and 2nd epoch accordingly higher the loss. The accuracy becomes higher in higher
iterations and epochs with a decrease in loss. After the 2nd epoch, there were almost
steady values of accuracy near 100% and lower loss of less than 0.3 as can be observed in
Figure 6.
Figure 6(a–b) shows the loss and accuracy in different iterations obtained using
GoogleNet. In initial iterations, the mini-batch and validation values were higher and
decreased in higher iterations. As shown in Figure 7(a), the mini batch at selected iter-
ations using GoogleNet was found as 1st iteration (0.8184), 10th iteration (0.2545), 20th
iteration (0.2597), 45th iteration (0.2059) and 55th iteration (0.0422). Similarly, validation

Figure 7. Performance measure using GoogleNet (a) Loss, (b) Accuracy.
loss at selected iterations was found as 1st iteration (0.7308), 10th iteration (0.3007), 20th
iteration (0.1162), 45th iteration (0.0685) and 55th iteration (0.0669). Moreover, accuracy
at selected iterations using GoogleNet is reflected in Figure 7(b). The validation accuracy
was found as 1st iteration (40%), 10th iteration (90%), 20th iteration (85%), 45th iteration

(90%), and 55th iteration (100%). Similarly, mini-batch accuracy was found as 1st iteration
(36.30%), 10th iteration (84.44%), 20th iteration (96.60%), 45th and 55th iteration (98.15%).
4. Discussions
The CNN uses a convolution operation on the convolution layer which serves as a detection
filter for the presence of a particular feature or pattern present in the original data. Instead
of being a priori assigned as in conventional image processing, parameters of such filters
and learned based on training data and are specialized to solve the problem at hand. This
shows that lower layers of CNN can detect features that are usually common for each of
the image recognition tasks, such as edges and curves [67]. Convolutional Neural Networks
(CNNs) have had the greatest impact within the field of health informatics. Its architecture
can be defined as an interleaved set of feed-forward layers implementing convolutional
filters followed by reduction, rectification, or pooling layers. Each layer in the network orig-
inates a high-level abstract feature [88]. First, in CNNs weights in the network are shared
in such a way that the network performs convolution operations on images. This way, the
model does not need to learn separate detectors for the same object occurring at different
positions in an image, making the network equivariant with respect to translations of the
input. It also drastically reduces the number of parameters (i.e. the number of weights no
longer depends on the size of the input image) that need to be learned.
In DL, the first CNN winning the ILSVRCs, which also made CNN’s very popular, was the
AlexNet architecture [33]. This architecture comprises 5 convolutional layers, max-pooling
layers, dropout layers, and three fully connected layers and employs ReLU as an activation
function. It obtained a top-5 error rate of 15.6%, the error in classifying an image within the
closest five classes. AlexNet was then improved in the next year by the authors by modify-
ing its parameters and achieving a top-5 error rate of 11.2% [59]. In 2014, VGGNet [89], even
though it did not win the competition, showed that it was possible to reduce the number
of parameters and, at the same time to increase the depth of the network, achieving bet-
ter performance than the architecture mentioned above with an error rate of 7.3%. This
architecture is composed of more convolutional layers than AlexNet, 13 exactly, which are
smaller in terms of filter dimensions leading to a reduction of parameters but being able to
learn more high-level features than previous CNN.
Another essential architecture, the winner of the ILSVRC 2014 with an error rate of 6.7%,
is GoogleNet [59,89]. The architecture changes the way of structuring CNN architectures,
which stack single layers one upon other sequentially, introducing the inception module.
The architecture is modularized, and the main block is the inception module, which is com-
posed of convolutional layers that are arranged in parallel. GoogleNet has 122 layers, but
not all in sequential, as in AlexNet, parts of the network are executed in parallel, mainly its
Inception module. Each of its nine Inception modules is a network within the network layer
leading to over 100 layers total. GoogleNet trained on ‘a few high-end GPUs within a week’
[14].
In the present study, we first extracted hand-crafted features and fed to different tra-
ditional machine leaning (ML) algorithms. For ML techniques, different features such as
texture, morphological, entropy based, SIFT, and EFDs are extracted from breast cancer
mammograms. In the second phase CNN methods utilizing a TL approach was employed in
which GoogleNet and AlexNet pre-trained models are trained. The deep learning methods

are more robust when the data volume is large. Moreover, deep learning models utilizes
the feature engineering processing using the domain knowledge by extracting high-level
characteristics directly from the data. This capability decreases the DL effort and time to
construct a feature extractor for each problem. The GoogleNet was retrained on the new
set of cancer images. The weights in the earlier layers in the network were frozen by set-
ting learning rate to zero. Moreover, the parameters were not updated during freezing the
training layers, which help to improve the network performance significantly and also help-
ful to avoid overfitting. For each model, we have used default and optimized parameters
for evaluating the performance. The deep learning model with transfer learning approach
effectively utilizing previously learning model knowledge to solve the new task with fine-
tuningorminimumtraining.Thedeeptransferlearning(DTL)approachishelpfultoaddress
the computational issues. By applying the traditional machine learning algorithms with
hand-crafted features, the Naïve Bayes yielded the highest accuracy (57.54%) with SIFT
features. The SVM Polynomial yielded the highest accuracy (82.65%) with texture feature,
SVM RBF provided an accuracy (85.21%) with entropy features, SVM Gaussian with entropy
features provided an accuracy (84.87%), and decision tree yielded an accuracy (85.65%)
with entropy features. The deep learning models with transfer learning approach improved
the classification performance i.e. GoogleNet with default parameters yielded accuracy
(99.26%), AUC (0.9998), and AlexNet with optimized parameters yielded accuracy (99.26%)
and AUC (0.9996).
5. Conclusion
In this research, the CNN models are employed. Results are compared with ML classi-
fication techniques such as SVM kernels, Bayesian approach, and Decision Tree to dis-
tinguish the cancer mammograms from that of normal subjects. The mass detection is
due to the low image contrast, and microcalcification is due to the large variation in
size and shape for which multimodal features are extracted to distinguish the cancer
mammograms effectively to. We extracted texture, morphology, entropy based, SIFT, and
EFDs features for training and validating the ML classifiers. A 10-fold cross-validation was
used to train and test the image database. The performance was measured based on
specificity, sensitivity, PPV, NPV, FPR, and AUC. The CNN GoogleNet with default param-
eters and AlexNet with optimized parameters gives the highest performance of (TA,
Sensitivity = 99.26%, AUC = 0.9998,0.9996) respectively, followed by Decision Tree with
(TA = 85.65%, AUC = 0.9173), SVM RBF with (TA = 85.21%, AUC = 0.8057). Using ML
classifiers, the entropy-based features give the highest performance evaluation measures
than the other features extracted from breast cancer mammograms. The detection per-
formance utilizing the deep learning methods with transfer learning approach improved
the classification performance than traditional machine learning algorithms due to the
dynamic feature engineering characteristics. Thus, the proposed approach is more robust
for improving the detection of breast mammogram and improving the healthcare systems.
5.1. Limitations and future directions
The present study was focused to apply machine learning methods with diverse hand-
crafted features based approaches and deep learning methods. Though researchers are still
working on multiple aspects of feature-extracting strategies to improve the classification

performance using deep learning algorithms. In this context, a light deep learning based
architecture using minimum number of layers for optimized MRI scans with empirically
controlled unknown parameters generating dynamic features will be utilized. Similarly, the
attention mechanisms are being constantly used to focus the important regions in the
image by enhancing the weight of the image location, thereby taking care of the loss of
spatial information at the cost of improving the feature information will be used. Another
future direction is to collect a primary dataset for better BC control, containing the clin-
ical parameters and demographic profiles of the patients as well as pathological control
response, survival, and progression of the patients. We will also utilize the hybrid deep
learning methods, and parametric optimization using grid search, Bayesian optimization
and genetic algorithms to further improve the classification performance.
Acknowledgement
This study is supported via funding from Prince Sattam Bin Abdulaziz University project number
PSAU/2023/R/1444. The authors would like to thank the Deanship of Scientific Research. At Shaqra
University for supporting.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
[1] Forouzanfar MH, et al. Breast and cervical cancer in 187 countries between 1980 and 2010: a
systematic analysis. Lancet. 2011;378(9801):1461–1484.
[2] Jemal A, et al. Global cancer statistics. CA Cancer J Clin. 2011;61(2):69–90.
[3] Dheeba J, Singh NA, Selvi ST. Computer-aided detection of breast cancer on mammo-
grams: a swarm intelligence optimized wavelet neural network approach. J Biomed Inform.
2014;49:45–52.
[4] DeSantis CE, et al. Breast cancer statistics, 2015: convergence of incidence rates between black
and white women. CA Cancer J Clin. 2016;66(1):31–42.
[5] Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin. 2019;69(1):7–34.
[6] Wirén S, et al. Pooled cohort study on height and risk of cancer and cancer death. Cancer Causes
Contr. 2014;25(2):151–159.
[7] Walter RB, et al. Height as an explanatory factor for sex differences in human cancer. J Natl Cancer
Inst. 2013;105(12):860–868.
[8] Ardakani AA, Gharbali A, Mohammadi A. Classification of breast tumors using sonographic
texture analysis. J Ultrasound Med. 2015;34(2):225–231.
[9] Sprague BL, et al. Variation in mammographic breast density assessments among radiologists in
clinical practice: a multicenter observational study. Ann Intern Med. 2016;165(7):457–464.
[10] Freer PE. Mammographic breast density: impact on breast cancer risk and implications for
screening. Radiographics. 2015;35(2):302–315.
[11] Acharya UR, et al. Data mining framework for breast lesion classification in shear wave ultra-
sound: a hybrid feature paradigm. Biomed Signal Process Contr. 2017;33:400–410.
[12] Zhang L, et al. Identifying ultrasound and clinical features of breast cancer molecular subtypes
by ensemble decision. Sci Rep. 2015;5(1):1–14.
[13] Sathish D, et al. Medical imaging techniques and computer aided diagnostic approaches for the
detection of breast cancer with an emphasis on thermography-a review. Int J Med Eng Inform.
2016;8(3):275–299.
[14] Machida Y, et al. Single focus on breast magnetic resonance imaging: diagnosis based on kinetic
pattern and patient age. Acta Radiol. 2017;58(6):652–659.

[15] Kolb TM, Lichy J, Newhouse JH. Comparison of the performance of screening mammography,
physical examination, and breast US and evaluation of factors that influence them: an analysis
of 27,825 patient evaluations. Radiology. 2002;225(1):165–175.
[16] Cheng H-D, et al. Approaches for automated detection and classification of masses in mammo-
grams. Pattern Recognit. 2006;39(4):646–668.
[17] Skaane P, Engedal K. Analysis of sonographic features in the differentiation of fibroadenoma and
invasive ductal carcinoma. Am J Roentgenol. 1998;170(1):109–114.
[18] Doi K. Computer-aided diagnosis: potential usefulness in diagnostic radiology and telemedicine.
In Proceedings of the National Forum: Military Telemedicine On-Line Today Research, Practice,
and Opportunities. 1995. IEEE.
[19] Hussain L, et al. Spatial wavelet-based coherence and coupling in EEG signals with eye open and
closed during resting state. IEEE Access. 2018;6:37003–37022.
[20] Hussain L, et al. Arrhythmia detection by extracting hybrid features based on refined Fuzzy
entropy (FuzEn) approach and employing machine learning techniques. Waves Random Com-
plex Media. 2020;30(4):656–686.
[21] Hussain L, et al. Regression analysis for detecting epileptic seizure with different feature extract-
ing strategies. Biomed Eng Biomed Tech. 2019;64(6):619–642.
[22] Karahaliou AN, et al. Breast cancer diagnosis: analyzing texture of tissue surrounding microcal-
cifications. IEEE Trans Inf Technol Biomed. 2008;12(6):731–738.
[23] Kupinski MA, Giger ML. Automated seeded lesion segmentation on digital mammograms. IEEE
Trans Med Imaging. 1998;17(4):510–517.
[24] Sahiner B, et al. Improvement of mammographic mass characterization using spiculation mea-
sures and morphological features. Med Phys. 2001;28(7):1455–1465.
[25] Zhen L, Chan AK. An artificial intelligent algorithm for tumor detection in screening mammo-
gram. IEEE Trans Med Imaging. 2001;20(7):559–567.
[26] CaldwellCB,etal.Characterisationofmammographicparenchymalpatternbyfractaldimension.
Phys Med Biol. 1990;35(2):235.
[27] Li H, Liu KR, Lo S-C. Fractal modeling and segmentation for the enhancement of microcalcifica-
tions in digital mammograms. IEEE Trans Med Imaging. 1997;16(6):785–798.
[28] Chen D-R, et al. Classification of breast ultrasound images using fractal feature. Clin Imaging.
2005;29(4):235–245.
[29] Hussain L, et al. Applying Bayesian network approach to determine the association between
morphological features extracted from prostate cancer images. IEEE Access. 2018;7:1586–1601.
[30] Qureshi SA, et al. Intelligent ultra-light deep learning model for multi-class brain tumor detec-
tion. Appl Sci. 2022;12(8):3715.
[31] Hussain L, et al. Prostate cancer detection using machine learning techniques by employing
combination of features extracting strategies. Cancer Biomark. 2018;21(2):393–413.
[32] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444.
[33] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural
networks. Commun ACM. 2017;60(6):84–90.
[34] Delphia AA, Kamarasan M, Sathiamoorthy S. Image processing for identification of breast cancer:
a literature survey. Asian J Electr Sci. 2018;7(2):28–37.
[35] Kupinski MA, et al. Ideal observer approximation using Bayesian classification neural networks.
IEEE Trans Med Imaging. 2001;20(9):886–899.
[36] Lyons L. Statistical problems in particle physics, astrophysics and cosmology: PHYSTAT05,
Oxford, UK, 12–15 September 2005. 2006: Imperial College Press.
[37] Specht DF. Probabilistic neural networks. Neural Netw. 1990;3(1):109–118.
[38] Hamad YA, Simonov K, Naeem MB. Breast cancer detection and classification using artificial neu-
ral networks. In 2018 1st Annual International Conference on Information and Sciences (AiCIS).
2018. IEEE.
[39] Zheng B, Qian W, Clarke LP. Digital mammography: mixed feature neural network with
spectral entropy decision for detection of microcalcifications. IEEE Trans Med Imaging.
1996;15(5):589–597.

[40] Nahid A-A, Kong Y. Involvement of machine learning for breast cancer image classification: a
survey. Comput Math Methods Med. 2017;2017:3781951–3781951.
[41] Bhandare A, et al. Applications of convolutional neural networks. Int J Comp Sci Inform Technol.
2016;7(5):2206–2215.
[42] Lo S-CB, et al. A multiple circular path convolution neural network system for detection of
mammographic masses. IEEE Trans Med Imaging. 2002;21(2):150–158.
[43] Sahiner B, et al. Classification of mass and normal breast tissue: a convolution neural network
classifier with spatial domain and texture images. IEEE Trans Med Imaging. 1996;15(5):598–610.
[44] Jiao Z, et al. A deep feature based framework for breast masses classification. Neurocomputing.
2016;197:221–231.
[45] Fonseca P, et al. Automatic breast density classification using a convolutional neural network
architecture search procedure. In Medical imaging 2015: computer-aided diagnosis. 2015. SPIE.
[46] Su H, et al. Region segmentation in histopathological breast cancer images using deep convolu-
tional neural network. In 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI).
2015. IEEE.
[47] Huynh BQ, Li H, Giger ML. Digital mammographic tumor classification using transfer learning
from deep convolutional neural networks. J Med Imaging. 2016;3(3):034501.
[48] Arevalo J, et al. Representation learning for mammography mass lesion classification with
convolutional neural networks. Comput Methods Programs Biomed. 2016;127:248–257.
[49] Rezaeilouyeh H, Mollahosseini A, Mahoor MH. Microscopic medical image classification frame-
work via deep learning and Shearlet transform. J Med Imaging. 2016;3(4):044501.
[50] Jaffar MA. Deep learning based computer aided diagnosis system for breast mammograms. Int
J Adv Comp Sci Appl. 2017;8:7.
[51] Jadoon MM, et al. Three-class mammogram classification based on descriptive CNN features.
BioMed Res Int. 2017;2017:3640901–3640901.
[52] Gastounioti A, et al. Using convolutional neural networks for enhanced capture of breast
parenchymal complexity patterns associated with breast cancer risk. Acad Radiol. 2018;25(8):
977–984.
[53] Wang H, et al. Breast mass classification via deeply integrating the contextual information from
multi-view data. Pattern Recognit. 2018;80:42–52.
[54] Zhu W, et al. Adversarial deep structured nets for mass segmentation from mammograms. In
2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). 2018. IEEE.
[55] Ribli D, et al. Detecting and classifying lesions in mammograms with deep learning. Sci Rep.
2018;8(1):1–7.
[56] Chiao J-Y, et al. Detection and classification the breast tumors using mask R-CNN on sonograms.
Medicine. 2019;98:19.
[57] Nahid A-A, Mehrabi MA, Kong Y. Histopathological breast cancer image classification by deep
neural network techniques guided by local clustering. BioMed Res Int. 2018;2018:2362108–
2362108.
[58] Szegedy C, et al. Rethinking the inception architecture for computer vision. In Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
[59] Shin H-C, et al. Deep convolutional neural networks for computer-aided detection: CNN archi-
tectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging. 2016;35(5):
1285–1298.
[60] Chen H, et al. Standard plane localization in fetal ultrasound via domain transferred deep neural
networks. IEEE J Biomed Health Inform. 2015;19(5):1627–1636.
[61] Heath M, et al. Current status of the digital database for screening mammography. In: Karssemei-
jer N, Thijssen M, Hendriks J, et al., editors. Digital mammography. Dordrecht: Springer; 1998. p.
457–460.
[62] Lévy D, Jain A. Breast mass classification from mammograms using deep convolutional neural
networks. arXiv preprint arXiv:1612.00542, 2016.
[63] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning. 2015. PMLR.

[64] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics.
2010. JMLR Workshop and Conference Proceedings.
[65] Chen T, et al. Improving sentiment analysis via sentence type classification using BiLSTM-CRF
and CNN. Expert Syst Appl. 2017;72:221–230.
[66] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2009;22(10):1345–1359.
[67] Yosinski J, et al. How transferable are features in deep neural networks? Adv Neural Inf Process
Syst. 2014;27:1792.
[68] Deng J, et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on
computer vision and pattern recognition. 2009. IEEE.
[69] Zhu Z, et al. Extreme weather recognition using convolutional neural networks. In 2016 IEEE
International Symposium on Multimedia (ISM). 2016. IEEE.
[70] Elhoseiny M, Huang S, Elgammal A. Weather classification with deep convolutional neural
networks. In 2015 IEEE International Conference on Image Processing (ICIP). 2015. IEEE.
[71] Soekhoe D, Putten PVD, Plaat A. On the impact of data set size in transfer learning using deep
neural networks. In International symposium on intelligent data analysis. 2016. Springer.
[72] Chu B, et al. Best practices for fine-tuning visual classifiers to new domains. In European confer-
ence on computer vision. 2016. Springer.
[73] Kim KG. Book review: deep learning. Healthc Inform Res. 2016;22(4):351–354.
[74] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In Proceedings of the four-
teenthinternationalconferenceonartificialintelligenceandstatistics.2011.JMLRWorkshopand
Conference Proceedings.
[75] Goodfellow I, Bengio Y, Courville A. Convolutional networks. In: Goodfellow I, Bengio Y, Courville
A, editors. Deep learning. Cambridge: MIT Press; 2016. p. 330–372.
[76] Singh RG, Kishore N. The impact of transformation function on the classification ability of
complex valued extreme learning machines. In 2013 International Conference on Control, Com-
puting, Communication and Materials (ICCCCM). 2013. IEEE.
[77] Bengio Y. Practical recommendations for gradient-based training of deep architectures. In: Mon-
tavon G, Orr GB, Müller KR, editors. Neural networks: tricks of the trade. Berlin, Heidelberg:
Springer; 2012. p. 437–478.
[78] Maas AL, Hannun AY, Ng AY. Rectifier nonlinearities improve neural network acoustic models. In
Proc. ICML. 2013. Atlanta, Georgia, USA.
[79] Tóth L. Phone recognition with deep sparse rectifier neural networks. In 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing. 2013. IEEE.
[80] Nair V, Hinton GE. Rectified linear units improve restricted Boltzmann machines. Haifa: ICML;
2010.
[81] Jarrett K, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th
international conference on computer vision. 2009. IEEE.
[82] Lai M. Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000, 2015.
[83] Rosasco L, et al. Are loss functions all the same? Neural Comput. 2004;16(5):1063–1076.
[84] Boyd S, Boyd SP, Vandenberghe L. Convex optimization. Los Angeles: Cambridge University
Press; 2004.
[85] Hajian-Tilaki K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test
evaluation. Caspian J Intern Med. 2013;4(2):627.
[86] Mishra S, Panda M. A histogram-based classification of image database using scale invariant
features. Int J Image Graphics Signal Proc. 2017;9(6):55.
[87] Hussain L. Detecting epileptic seizure with different feature extracting strategies using robust
machine learning classification techniques by applying advance parameter optimization
approach. Cogn Neurodyn. 2018;12(3):271–294.
[88] Ravì D, et al. Deep learning for health informatics. IEEE J Biomed Health Inform. 2016;21(1):4–21.
[89] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.

Hussain et al BC Deep Learning March 2023.pdf

Recommended

Recommended

More Related Content

Similar to Hussain et al BC Deep Learning March 2023.pdf

Similar to Hussain et al BC Deep Learning March 2023.pdf (20)

More from LallHussain

More from LallHussain (6)

Recently uploaded

Recently uploaded (20)

Hussain et al BC Deep Learning March 2023.pdf