Conference_paper.pdf

PREDICTION OF COVID-19 USING
MACHINE LEARNING TECHNIQUES
Narenraj Vivekanandan
Dept. of Electrical Engineering
National Institute of Technology
Calicut, India
naren.raj.vivek7@gmail.com
Mohamed Ashiq Rahman S
Calicut, Inda
ashiqxq@gmail.com
Vedant Mahalle
Calicut, India
vedantmahalle21@gmail.com
Sharath M Nair
Calicut, India
sharathmappu99@gmail.com
Abstract—Numerous techniques have been proposed by
WHO and other esteemed medical authorities for the diagnosis of
the COVID-19 virus. The most popular diagnostic method is
Reverse transcription polymerase chain reaction (RT –PCR).
Other clinical diagnosis techniques involve antibody tests. There
has been other research focused on classifying the covid vs
non-covid classification using chest x-ray images. However, many
of these classification is done over images that account for
increased overfitting. We propose a different model that employs
wavelet entropy to extract features from and then classify the
chest x-ray images. The proposed technique extracts space
frequency features from chest x-ray images using Discrete
Wavelet Transform, the dimensionality of which is reduced using
Shannon entropy technique, and resulting vector is trained using
Standard machine learning classifiers such as Logistic
Regression, Support Vector Machine, Decision Tree classifier,
Gaussian Naïve bayes and Convolutional Neural Network
Keywords— wavelet transform, entropy, logistic regression,
naive bayes, decision tree, support vector machine
I. INTRODUCTION
Rapid and reliable diagnosing of COVID-19 is one of the
foremost challenges we face today. This is most important for
those who may be critical and need medical care. The main
effect of SARS-COV-2 or COVID-19 is that it affects the
lungs of the infected person. The most common effects of the
virus is that it causes severe respiratory illness and pneumonia.
These effects can be commonly diagnosed with the
examination of Chest X-Ray(CXR) images. Previous studies
have shown that machine learning models are much more
accurate and better in reading X-ray images than a human eye.
Diagnosing COVID-19 with CXR images is much more
reliable and rapid than RT-PCR test or Antigen tests. We have
built a machine learning model which can help the medical
community with speedy diagnosis of COVID-19 with the use
of CXR images using a pretrained model. We will use
Discrete Wavelet Transform(DWT) for feature extraction as
studies have shown that wavelet transforms are excellent in
detecting edges and distinguishing frequencies. We will
further use different classifiers to train our model such as
Logistic Regression, Support Vector Machine , Decision Tree
Classifier and Naive Bayes and study the results. Also, we will
examine the effect of using CNN to classify our images
without feature extraction.
II. BASIC CONCEPTS
A. Discrete Wavelet Transform
The discrete wavelet transform (DWT) is used to get the
multi-scale (frequency) representation of the function.Using
wavelets, the image data can be analyzed in multiple
resolutions. Wavelet transformation is better at capturing fine
details because of the high frequency components. The 1D
-DWT of signal x is calculated by passing it through a pair of
high and low pass filters (quadrature mirror filters) with
impulse response h, and g respectively
Fig 1. Filter representation of Wavelet Transform.
The Approximation coefficients is represented by
[k]g[2n ]
Y low = ∑
∞
k=−∞
x − k (2)
The detail coefficients are represented as
[k]h[2n ]
Y high = ∑
∞
k=−∞
x − k (3)

At every decomposed level since half of the frequency is
discarded, half of the samples can be discarded as well as per
the Nyquist criterion.
1 dimensional discrete wavelet transform (1D-DWT) can
be extended to (2D-DWT) by processing along the x and y
axis using low pass filters (expanded wavelets) and high pass
filters (shrunken wavelets). Four sub-band of images (HH1,
LH1, HL1, LL1) at each scale will be generated after the
level-1 decomposition. The A1 sub-band containing the
low-frequency components can be regarded as the
approximation component of the image.
Fig 2. 2-level wavelet decomposition
while the LH, HL, and HH sub-bands, which contain
relatively higher frequency portions of the image, have the
more detailed components of the image.
Working over the assumption that most of the image data is
contained in the LL1 sub-band, it can be further decomposed
to level-2 thus arriving at 7 sub-bands (HH1, HL2, LH2, HH2,
HL1, LH1, HH1)
B. Entropy
The major disadvantage of the Discrete wavelet transform
technique is the curse of dimensionality. Too many features
results in increased computation times and excessive storage
memory. To overcome this disadvantage, we have to reduce
the number of coefficients, thus we employ an additional
parameter, entropy, to reduce the dimension by averaging out
the inter-related variables while maintaining the sufficient
information. In information theory entropy is the minimum
limit to which you can compress an information without loss.
Shannon defined that the entropy H for a discrete random
variable X with values {x1, x2, … xn}and probability mass
function P(X) as:
H(X) = - (4)
log x
∑
n
i=0
xi b i
Shanon’s entropy thus quantifies the amount of
information available in a variable. It’s metric is defined as
the absolute minimum amount of storage required to
succinctly capture any information.
C. Feature Extraction
For a 256*256 image there can be 65536 coefficients
however with the inclusion of entropy parameter, the number
of features can be reduced to 7 entropy vectors with each
vector corresponding to a sub-band after 2-level 2D wavelet
transform of the image. This can be computationally efficient.
III. MACHINE LEARNING MODELS
A. Naive Bayes
Naive Bayes is a family of probabilistic algorithms that use
probability theory and the theorem of Bayes to predict an
event. They are probabilistic, meaning that they measure for a
given data the likelihood of each label, and then output the
label with the highest one. Using Bayes' Theorem, which
defines the likelihood of feature, is the way they get these
probabilities, based on previous knowledge of what could be
relevant to that feature.
Abstractly, Naive Bayes is a Conditional Probability
model: We are given a problem sample X to be classified,
where
{x , x , x ......, x }
X = 1 2 3 n (5)
Where X represents n features (independent variables).
The probability estimated from the model will be a dependent
class C with a small number of outcomes (Covid positive/
negative here) conditional on feature vector X.
(6)
(C |x , , x ......, x )
P K 1 x2 3 n
Here if a feature can take on a large number of values, or
the number of features n is large, then basing such a model on
Probability tables is impractical. Thus using Bayes’ Theorem,
the conditional probability can be reduced to
(C |x)
p k =
p(x)
p(C )p(x|C )
k K
(7)
Thus the posterior probability is formed combining both
sources of information, the prior and the likelihood. Since the
features are known beforehand, the denominator is a constant
and is not considered in practice.
Now considering the conditional independence of the
features i.e since each feature Xi is independent, the joint
model can be expressed as
(8)
(C | x , , x ......, x ) ∝p(C ) (x |C )
P K 1 x2 3 n k ∏
n
i=1
p i k
Where P(xi | Ck) can be estimated using the training
sample.
B. Logistic Regression
The name logistic regression comes from the logistic
function or the sigmoid function used as the activation
function. The sigmoid function has a range of 0 to 1 thus it is
widely used in models that require a probability estimate as an
output.
Logistic regression is a statistical model that in its basic
form uses a logistic function to model a binary dependent
variable. In regression the parameters corresponding to most
accurate probability is estimated.
Let X be an n*d dimensional matrix. Here n is the number
of samples and d is the number of features or independent
attributes, and y be a binary outcomes vector. y is a
n*1dimensional matrix which corresponds to the labels for
each 1*d data in X

A linear model to describing this problem would be of
form
(9)
W X
Z = T
+ B
(10)
(z)
y
︿
= a = σ
(11)
(a, y) loga 1 )log(1 )
L = − y + ( − y − a
Where a is the sigmoid of z and represents the probability
of a class to occur given a data in X and y is the ground truth
(0 or 1). L is the loss function which is a relationship between
y and a, and the objective of the regression is to estimate the
parameter vectors w and b to minimise the Loss function as
much as possible. This can be done using Gradient Descent.
In gradient descent, we reduce the parameters w and b by
dw and db until the optimal parameters are achieved. Here dw
is the derivative of the loss function with respect to the
parameter w and db is the derivative of the loss function with
respect to the parameter b. Here,
dz = a - y (12)
dw = x*dz (13)
db = dz (14)
w = w - *dw (15)
b = b - *db (16)
Where 𝜶 is the learning rate of the algorithm.
C. Decision Tree Classifier
Decision tree algorithm is from a class of supervised machine
learning algorithms. The goal of the classifier is to create an
optimal decision tree from the given set of features and labels
so that it can predict the label of a new set of features by
iterating down the decision tree.
A decision tree consists of a root node (which is the best
predictor) , a set of inner nodes and leaf nodes. Leaf nodes
correspond to different classes the dataset belongs to, whereas
the root node and the inner nodes correspond to the features
extracted from the dataset.
The performance of the classifier depends on how good the
tree is constructed from the training data. The process of
building a decision tree is recursive. It begins from the root
node and continues to split the dataset into many subsets
depending on the number of classes. The features which best
predicts a particular sub dataset takes the place in that
particular inner node in the tree.
A common metric to measure which feature is the best
predictor of a sub dataset is the Gini impurity of that sub
dataset. Gini impurity measures how often a random element
from the dataset would be mis-classfied if it was randomly
labeled according to the distribution of classes in the subset.
The Gini impurity can be calculated by summing the
probability of class i being chosen times the probability of
pi
misclassifying that item which is .
1 − pi
To compute the Gini impurity of a sub dataset with J classes
(p) (p ) (1 )
G = ∑
J
i=1
i ∑
k=i
/
pk = ∑
J
i=1
pi − pi = ∑
J
i−1
pi − ∑
J
i=1
pi
2
Hence, (p)
G = 1 − ∑
J
i=1
pi
2
(17)
Where can be estimated in each sub dataset.
pi
D. Support Vector Machine
The Support Vector Machine (SVM) is a machine learning
classifier that takes a multi-dimensional data vector and the
class/label they belong to and establishes a boundary called
the decision boundary between the various classes, so that it is
simple to identify new data by inspecting the boundary it falls
within.
However, depending on the parameters a maximum margin
classifier may not always lead to an optimal decision boundary
as, if there are errors on either side of the boundary the
boundary may be very close to some data points. Hence, it is
important to sometimes allow misclassifications to find the
optimal boundary. Such a classifier that allows some
misclassification to find the most optimal boundary with
maximum margin is called a soft margin classifier or a support
vector classifier.
Mathematically, the aim of support vector machine is to
minimize in relation with eq.(18) and subject to eq.(19)
|w|
2
1 2
X
Y = WT
+ B (18)
Y < , x − |
| i − w i > b ≤ ε (19)
Again, a linear support vector classifier may not always be
optimal in the case of a dataset with complex features. Hence
different kernel functions exist using which we can find the
maximal margin hyperplane. Some of the more common
kernels are linear kernels, polynomial kernels and RBF
kernels. Kernels like polynomial kernel work in higher
dimensions to find the best support vector classifier while
radial basis function (RBF) also known as Gaussian kernels
are functions that are based on the absolute distance from a
data point (r = ||x−xi||) . The RBF kernel between two data
points,x and x′ is defined by
(x, x ) e
K ′ = −γ||x−x ||
′
2
(20)
Where is the Euclidean distance, γ is a parameter
||x ||
− x′ 2
specified and K(x,x′) is given as a feature vector.

IV. CONVOLUTIONAL NEURAL NETWORKS
A Convolutional Neural Network is a deep learning neural
network that is used to analyze visual imagery. It consists of
several layers in the order: input layer, hidden
layers(convolution layers, pooling layers and fully connected
layers) and output layer. ConvNet learns the features by
applying appropriate kernel filters. As the parameters are
decreased and weights updated, the network is able to
generalise very well on the image dataset. Its work is to ensure
that the images are in form that is easily handled, without
compromising the features which are essential for obtaining an
accurate prediction.
The convolution operation is a mathematical operation
applied on the input images to capture the high-level features
such as edges.A Pooling Layer almost always follows a
Convolutional layer and is used to reduce spatial size of the
matrix. It also employs dimensionality reduction to efficiently
lower the computational power necessary for model training.
By applying the above techniques, we have a convolved
matrix which understands several features from the images
fed. We will now flatten the matrix and employ a neural
network for classification.The flattened matrix has values
which are non-linear combinations and in order to learn these
combinations and make accurate predictions, we use a
Fully-Connected layer which in this case is a multi-level
perceptron. Backpropagation is applied to every iteration of
training. After some epochs, the model classifies the image
into two classes using the Sigmoid Classifier.
V. CLASSIFICATION & COMPARISON
A. Dataset
We used the publicly available CovidX dataset Covid-Net
Open Source Initiative by Linda Wang, Alexander Wong from
Department of Systems Design Engineering, University of
Waterloo, Canada. This is a standard and labelled dataset. This
dataset contains 14904 Non-Covid images and 594 Covid
images
Fig 3. Covid -ve CXR images
The images were read and converted to integer representation
using cv2 module, the obtained values were scaled uniformly
to avoid zero values that may lead to division by zero
scenarios. The images were then transformed to a 7 feature
vector using DWT and entropy.
Fig 4: Covid +ve CXR images
B. Result and Analysis
Choosing the appropriate parameters is essential to
arriving at the best classification model, for which we used
hyper-parameter tuning techniques to validate our models at
different parameter values. The Naive Bayes classifier turned
out to be independent of the major parameters such as prior
probability, the Logistic regression performed better with the
penalty set as ‘l2’ which uses ridge method, and solver set as
‘newton-cg’ that uses second order derivatives to arrive at
optimization. The DTC performed the best with the criterion
parameter set to ‘entropy’ as compared to ‘gini’. The SVM
showed the best with parameter ‘C’ set to 63, this parameter is
inversely proportional to the proportion of mis-classification
allowed, in SVM the kernel was set to ‘RBF’ as expected -
allowing classification to work in infinite dimension, the
gamma, that defines the curvature of rbf kernel is set to 0.001
thus allowing less curvature.
We compared the features obtained from DWT+entropy
technique using Decision Tree classifier, Logistic Regression
Classifier, Naive Bayes Classifier and Support Vector
Machine; The classification parameters were obtained from a
method of Hyper-parameter tuning. Alternatively, we used the
image directly without any other feature extraction in the
Convolutional Neural Network based classifiers.
TABLE I. C
LASSIFICATION C
OMPARISON
Feature Precision
Score
Recall
Score
F1-Score Accuracy
CNN NA NA NA 83.44%
DWT+ENTRO
PY+SVM
0.9162 0.867 0.8854 99.13%
DWT+DTC 0.855 0.8538 0.8494 98.85%
DWT+LRC 0.909 0.4298 0.5556 97.59%
DWT+NBC 0.907 0.7932 0.8414 98.86%

The f1-score is chosen as the appropriate classification
metric since we were dealing with an imbalanced dataset.
As evident from the scores given in Table I. the support
vector machine classifier did the best job at classification, with
a mean f1-score of 0.8854. The support vector machine came
ahead in all other classification metrics as well.
The logistic regression performed worse with a ‘F1-score’
of 0.5556 which is marginally better than random prediction,
this underwhelming performance can be attributed to the
linearly inseparable nature of the feature set, which the logistic
regression cannot classify
VI. CONCLUSION
.In this paper, we compared ML classification algorithms
to accurately predict covid-19 using the feature set extracted
from wavelet entropy.
Although the entropy values and other hyper-parameters
used in the classification are difficult to interpret, the proposed
method using SVM has good classification results. The
classification metrics can be improved by training with more
images, and more robust hyperparameter tuning, alternatively
we can use techniques other than entropy as a dimensionality
reduction measure.
The model can be further improved to accommodate more
diseases that can be diagnosed using CXR images thus in
future we can improve the model to a multi-disease
classification model. .
ACKNOWLEDGMENT
The work was done under the guidance of Dr. Shihabudeen
K.V, Assistant professor at National Institute of Technology,
Calicut.
REFERENCES..
[1] Sun, Da-Wen. (2008). Computer Vision Technology for Food Quality
Evaluation. 10.1016/B978-0-12-373642-0.X5001-7.
[2] Zhou, Xing-Xing & Zhang, Yu-Dong & ji, Genlin & Yang, Jiquan &
Dong, Zhengchao & Wang, Shuihua & Zhang, Guangshuai & Phillips,
Preetha. (2016). Detection of abnormal MR brains based on wavelet
entropy and feature selection. IEEJ Transactions on Electrical and
Electronic Engineering. 11. n/a-n/a. 10.1002/tee.22226.
[3] Akshay Iyer, Akanksha Pandey, Dipti Pamnani, Karmanya Pathak and
Prof. Mrs. Jayshree Hajgude “Email Filtering and Analysis Using
Classification Algorithms” IJCSI International Journal of Computer
Science Issues, Vol. 11, Issue 4, No 1, July 2014
[4] Joaquim de Moura, Jorge Novo, Marcos Ortega. "Fully automatic deep
convolutional approaches for the analysis of Covid-19 using chest X-ray
images", Cold Spring Harbor Laboratory, 2020
[5] Sohaib Asif, Yi Wenhui. "Automatic Detection of COVID-19 Using
X-ray Images with Deep Convolutional Neural Networks and Machine
Learning", Cold Spring Harbor Laboratory, 2020
[6] Zhang, Yudong, Shuihua Wang, Preetha Phillips, Zhengchao Dong,
Genlin Ji, and Jiquan Yang. "Detection of Alzheimer's disease and mild
cognitive impairment based on structural volumetric MR images using
3D-DWT and WTA-KSVM trained by PSOTVAC", Biomedical Signal
Processing and Control, 2015.
[7] Jian-Ding Qiu. "Prediction of the Types of Membrane Proteins Based on
Discrete Wavelet Transform and Support Vector Machines", The Protein
Journal, 02/18/2010
[8] "Wavelet-entropy based detection of pathological brain in MRI
scanning", Computer Science and Applications, 2015.
[9]
[10] Maher Maalouf. "Logistic regression in data analysis: an overview",
International Journal of Data Analysis Techniques and Strategies, 2011

Conference_paper.pdf

Recommended

Recommended

More Related Content

Similar to Conference_paper.pdf

Similar to Conference_paper.pdf (20)

Recently uploaded

Recently uploaded (20)

Conference_paper.pdf