2. II. CLASSICAL CONVOLUTION NEURAL NETWORK (CNN)
MODEL
A. Convolution Operation
The convolution operation between an image X ∈ ℜ(u×u)
and a filter F ∈ ℜ(v×v)
is defined as follows
X~F = C
(
(u − v + 2Pad) + 1
s
×
(u − v + 2Pad) + 1
s
)
(1)
where,
C[a, b] =
u
∑
k=0
u
∑
l=0
X[k, l]F[a − k, b − l] (2)
Here,~ denotes the convolution operation. The stride s
denotes the number of pixels by which F is sliding over X.
The padding Pad is the number of zeros applied around X.
a, b, k, and l are the rows and columns indices of C and X.
B. Convolution Layer
The convolution layer shown in Fig.1 is a model consisting
of a convolution map and a pooling map. Here, in the
convolution operation the padding Pad = 0 and the stride
s = 1.
Fig. 1. Convolution Layer (ConvL).
Convolution Map
The convolution map is designed as follows:
• An input image X ∈ ℜ(u×u)
.
• r filters Fj ∈ ℜ(v×v)
; j = 1, 2, · · · , r.
• A bias matrix Bfj ∈ ℜ((u−v+1)×(u−v+1))
.
• A non-linear activation function f.
• An output matrix Cpj ∈ ℜ((u−v+1)×(u−v+1))
.
The output convolution map is defined as follows
Cj = X ~ Fj + Bfj (3)
Cpj = f(Cj) (4)
Pooling Map
A pooling map is an essential unit of CNN architecture.
This step is used to reduce the computational complexity of
the network through the minimization of the Cpj dimension.
Average pooling and max pooling are examples of the pooling
operation [18], [19]. Typically, a kernel Kj of size (2×2) with
a stride equals 2 can be applied to calculate the average or the
maximum value for each patch of Cpj. The output pooling
operation Pj has the size of (u−v+1
2 × u−v+1
2 ).
In appendix 1, we present the average and max-pooling
operations.
C. Fully Connected Layer
The convolution filters detect features of the input images,
called local features. A fully connected layer added in series
with the convolution layer to recognize the input images [12],
[15], [26]. As shown in Fig.2, a fully connected layer has an
input vector Y 0
corresponding to the concatenation of the r
CNN pooling map Pj. The output vector Y t
defines image
classes. Typically, a series of t fully connected hidden layers
are added between the input and output vector to enhance the
CNN performance.
Fig. 2. Fully Connected layer (FCL).
The basic equations of fully connected hidden layers are as
shown
Hi
= Wi
Y (i−1)
+ Bi
(5)
Y i
= f(Hi
) (6)
Where, Hi
is the weight sum vector, Bi
defines the bias
vector, Wi
is the weight matrix that represents the intercon-
nection between hidden layers . Y i
denotes the output vector
of a select fully connected hidden layer.
D. Convolution Neural Networks Model
The convolution neural network shown in Fig.3 is composed
of one convolution layer convL in series with t fully connected
hidden layers L. X is the input image that will be recognized
by the CNN model. Y t
is the CNN output corresponding to
the image recognition.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 08:24:47 UTC from IEEE Xplore. Restrictions apply.
3. Fig. 3. Convolution Neural Network Architecture.
III. VECTOR-BASED CNN MODEL
This section aims to replace the classical convolution
operation by matrix operation.
Definition: We define the vector expression of any matrix
M ∈ ℜ(n×n)
as follows
M =
M1T
M2T
.
.
.
MnT
, M̄(n2×1) =
M1
M2
.
.
.
Mn
In fact, in this section the output convolutions map Cpj of
size ((u − v + 1) × (u − v + 1))will be transformed into a
vector ¯
Cpj of dimension((u − v + 1)2
× 1).
Based on appendix 2, the output convolved vector ¯
Cpj and
the output average pooling vector ¯
Pj are defined as follows
C̄j = Xx · ¯
Fj + B̄fj (7)
¯
Cpj = f ¯
(Cj) (8)
¯
Pj =
(
c̄pj1+c̄pj2+c̄pj3+c̄pj4
4 )
(
c̄pj5+c̄pj6+c̄pj7+c̄pj8
4 )
.
.
.
(
c̄pj(w2−3)+c̄pj(w2−2)+c̄pj(w2−1)+c̄pjw2
4 )
(9)
Not that Xx is ((u − v + 1)2
× v2
) input image, ¯
Fj is (v2
×
1) vector filter and B̄fj is ((u − v + 1)2
× 1) bias convolved
vector.
A. Forward Propagation
The CNN model developed in this study consists of one
convolution layer in series with three fully connected hidden
layers i = 3.
Convolution Layer
Based on equations (7), (8), and (9) the output convolution
map and the average pooling map of a model consists of r
vector filters can be written as follows
C̄((r×w2)×1) = Xo((r×w2)×(r×v2))·F̄((r×v2)×1)+B̄f((r×w2)×1)
(10)
¯
Cp = f ¯
(C) (11)
P̄((r× w2
4 )×1)
=
P̄1
P̄2
.
.
.
P̄r
(12)
where,w = u − v + 1
Xo((r×w2)×(r×v2) =
Xx 0 · · · 0
0 Xx · · · 0
.
.
.
.
.
.
...
.
.
.
0 0 · · · Xx
F̄(r×v2) =
F1
F2
.
.
.
Fr
, B̄f(r×w2) =
Bf1
Bf2
.
.
.
Bfr
Concatenation
Concatenation is the operation that defines the input of the
fully connected layer as a function of r pooling operations.
Y 0
(m×1) = P̄ (13)
where, m = r × w2
4 .
Fully Connected Layer
The fully connected hidden layers equations are derived
from [14].
Layer 1
Y 1
(n×1) = f(W1
(n×m)Y 0
(m×1) + B1
(n×1)) (14)
n denotes the number of artificial neurons in the first fully
connected hidden layer.
Layer 2
Y 2
(o×1) = f(W2
(o×n)Y 1
(n×1) + B2
(o×1)) (15)
o denotes the number of artificial neurons in the second
fully connected hidden layer.
Layer 3
Y 3
(p×1) = f(W3
(p×o)Y 2
(o×1) + B3
(p×1)) (16)
p is the dimension of the input labeled data.
B. Back Propagation
To update the CNN parameters and perform the learning
process, a back propagation algorithm is developed to min-
imize a cost function E. In our analysis, the mean squared
error cost function [12], [15] is used.
E =
1
2
(Y 3
− Yd)T
(Y 3
− Yd) (17)
Equation (18) shows the gradient descent method to update
the CNN parameters.
ℓnew = ℓold − α(
∂E
∂ℓold
)T
(18)
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 08:24:47 UTC from IEEE Xplore. Restrictions apply.
4. Here, ℓnew represents the update of bias convolution vector
B̄f , filter vector F̄, weight matrix W, and bias fully connected
vector B. α is the learning rate, we can choose it as a constant
or a variable with a positive value.
We note that the update equations of the CNN parameters
involve the computations of ( ∂E
∂ℓold
). We develop below the
parameters update for each layer.
Fully Connected Layer
For the fully connected layer, the detailed development of
the back propagation is proposed in [14], where
∂E
∂Bi
= [(W(i−1)T
·
∂E
∂B(i−1)
) ∗ f
′
(Hi
)]T
(19)
∂E
∂Wi
= (
∂E
∂Bi
)
T
· Y (i−1)T
(20)
The operation ∗ is defined in appendix 3.
Convolution Layer
∂E(1×1)
∂B̄f((r×w2)×1)
|(1×(r×w2)) =
∂E(1×1)
∂C̄((r×w2)×1)
∂C̄((r×w2)×1)
∂B̄f((r×w2)×1)
(21)
∂E(1×1)
∂F̄((r×v2)×1)
|(1×(r×v2)) =
∂E(1×1)
∂C̄((r×w2)×1)
∂C̄((r×w2)×1)
∂F̄((r×v2)×1)
(22)
By substituting (10) into (21) and (22), we obtain
∂C̄((r×w2)×1)
∂B̄f((r×w2)×1)
|((r×w2)×(r×w2)) = I (23)
∂C̄((r×w2)×1)
∂F̄((r×v2)×1)
|((r×w2)×(r×v2)) = Xo (24)
Here, I is the identity matrix
Computing of ∂E
∂C̄
∂E(1×1)
∂C̄((r×w2)×1)
|(1×(r×w2)) =
∂E(1×1)
∂C̄p((r×w2)×1)
∂C̄p((r×w2)×1)
∂C̄((r×w2)×1)
=
∂E
∂C̄p
|(1×(r×w2)) ∗ f′ ¯
(Cp)T
(1×(r×w2))
(25)
Definition: let’s define the operation Inc of any vector U as
follows:
U(n×1) =
u1
u2
.
.
.
un
Inc(U) =
u1
4
u1
4
u1
4
u1
4
.
.
.
un
4
un
4
un
4
un
4
(4n×1)
(26)
In this section, the Inc operation is used to increase the size of
the average pooling vector map, where
∂E
∂C̄p
= Inc(
∂E
∂P̄
) (27)
Since Y 0
=P̄
∂E
∂P̄
=
∂E
∂B1
W1
(28)
By substituting (28), (27),(25),(24), and (23) into (22) and (21)
we obtain
∂E
∂B̄f
= Inc(
∂E
∂B1
W1
) ∗ f′ ¯
(Cp)T (29)
∂E
∂F̄
=
∂E
∂B̄f
· Xo (30)
IV. SIMULATIONS RESULTS AND DISCUSSION
In this section, MNIST handwritten digits data are used to test the
CNN performance using the proposed matrix operation. This database
consists of a training set of 60,000 images and a testing set of 10,000
images. The images are a gray scale of dimension(28 × 28). The
(10 × 1) output vector classifies the digits from 0 to 9. To enhance
the CNN performance the following hyper-parameters are varied:
• CNN Width.
• CNN Height.
The performance of the CNN model corresponds to the ratio
between the total number of correct classifications and the number
of test images.
TABLE I
CNN PERFORMANCE ACCORDING THE VARIATION OF WIDTH AND
HEIGHT HYPER-PARAMETERS
N° of training images N° of filters Size of filters Performance
5,000
− − 0.9325
5
(32 × 1) 0.9414
(92 × 1) 0.9508
(132 × 1) 0.9478
10 (92 × 1) 0.9575
20 (92 × 1) 0.9582
10,000
− − 0.9481
5
(32 × 1) 0.9570
(92 × 1) 0.9589
(132 × 1) 0.9579
10 (92 × 1) 0.9623
20 (92 × 1) 0.9677
30,000
− − 0.9773
5
(32 × 1) 0.9785
(92 × 1) 0.9810
(132 × 1) 0.9798
10 (92 × 1) 0.9835
20 (92 × 1) 0.9871
60,000
− − 0.9790
5
(32 × 1) 0.9831
(92 × 1) 0.9867
(132 × 1) 0.9859
10 (92 × 1) 0.9874
20 (92 × 1) 0.9883
The simulated CNN model consists of :
• One convolution layer. The convolve activation function is the
Relu. The used pooling operation is the average.
• A fully connected layer. It comprises 3 hidden layers. Each
hidden layer forms of 200 artificial neurons. The activation
function is the same used on the convolution layer. The learning
rate equals 0.3.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 08:24:47 UTC from IEEE Xplore. Restrictions apply.
5. As shown in Table 1, we tested the CNN performance through the
variation of
• Number of convolving filters vector.
• Size of each filter.
• Number of training images.
Here (−) denotes a model consisting just with a fully connected
layer.
The peak of CNN performance is 0.9883. We obtained it by
increasing the size and the number of convolved filters vector and
the number of training images.
V. CONCLUSION
In this paper, a new matrix operation that substitutes the classical
convolution operation is developed. MNIST data of handwritten digits
is used to test the influence of the CNN hyper-parameters on the
model performance. The peak of performance achieved is 0.9883.
It is obtained using a CNN model composed of one convolution
layer and three fully connected hidden layers. The results deduced
from the simulation proposed do not represent the optimal CNN
hyper-parameters configuration. Further increase in the number of
convolution layers and the number of training dataset can be enhanced
the CNN performance.
APPENDIX
Appendix 1: Average and max-pooling operations
Cpj =
cp11 cp12 · · · cp1(u−v+1)
cp21 cp22 · · · cp2(u−v+1)
.
.
.
.
.
.
...
.
.
.
cp(u−v+1)1 cp(u−v+1)2 · · · cp(u−v+1)(u−v+1)
(31)
The pooling map defined as follows
PjAve =
P11 P12 · · · P1
(u−v+1)
2
P21 P22 · · · P2
(u−v+1)
2
.
.
.
.
.
.
...
.
.
.
P(u−v+1)
2
1
P(u−v+1)
2
2
· · · P(u−v+1)
2
(u−v+1)
2
(32)
For the Average Pooling Operation,
P11 =
cp11+cp12+cp21+cp22
4
P1
(u−v+1)
2
=
cp1(u−v)+cp1(u−v+1)+cp2(u−v)+cp2(u−v+1)
4
P(u−v+1)
2
1
=
cp(u−v)1+cp(u−v)2+cp(u−v+1)1+cp(u−v+1)2
4
P(u−v+1)
2
(u−v+1)
2
=
cp(u−v)(u−v)+cp(u−v)(u−v+1)+cp(u−v+1)(u−v)+cp(u−v+1)(u−v+1)
4
For the max Pooling Operation,
P11 = max(cp11, cp12, cp21, cp22)
P1
(u−v+1)
2
= max(cp1(u−v), cp1(u−v+1), cp2(u−v), cp2(u−v+1))
P(u−v+1)
2
1
= max(cp(u−v)1, cp(u−v)2, cp(u−v+1)1, cp(u−v+1)2)
P(u−v+1)
2
(u−v+1)
2
= max(cp(u−v)(u−v), cp(u−v)(u−v+1), cp(u−v+1)(u−v), cp(u−v+1)(u−v+1))
Appendix 2: Matrix Operation
For a CNN model consisting of an input image X ∈ ℜ(u×u)
convolved with a filter F ∈ ℜ(v×v)
X =
X1T
X2T
.
.
.
XuT
, F =
F1T
F2T
.
.
.
FvT
where,
XiT
=
[
xi
1 xi
2 · · · xi
u
]
FiT
=
[
fi
1 fi
2 · · · fi
v
]
We define XiT
j−
→k as follows
XiT
j−
→k =
[
xi
j xi
j+1 · · · xi
k
]
The proposed matrix operation that substituted the classical con-
volution operation is:
X ~ F = Xx|((u−v+1)2×v2) · F̄|(v2×1)
=
XuT
1−
→v X
(u−1)T
1−
→v · · · X1T
1−
→v
XuT
2−
→(v+1) X
(u−1)T
2−
→(v+1) · · · X1T
2−
→(v+1)
.
.
.
.
.
.
...
.
.
.
XuT
(u−v+1)−
→1 X
(u−1)T
(u−v+1)−
→1 · · · X1T
(u−v+1)−
→1
X
(u−1)T
1−
→v X
(u−2)T
1−
→v · · · X2T
1−
→v
X
(u−1)T
2−
→(v+1) X
(u−2)T
2−
→(v+1) · · · X2T
2−
→(v+1)
.
.
.
.
.
.
...
.
.
.
X
(u−1)T
(u−v+1)−
→1 X
(u−2)T
(u−v+1)−
→1 · · · X2T
(u−v+1)−
→1
.
.
.
XvT
1−
→v X
(u−v)T
1−
→v · · · X
(u−v+1)T
1−
→v
XvT
2−
→(v+1) X
(u−v)T
2−
→(v+1) · · · X
(u−v+1)T
2−
→(v+1)
.
.
.
.
.
.
...
.
.
.
XvT
(u−v+1)−
→1 X
(u−v)T
(u−v+1)−
→1 · · · X
(u−v+1)T
(u−v+1)−
→1
F1
F2
.
.
.
Fv
(33)
Appendix 3: The definition of (∗) operation
Let’s define the operation (∗) for any vector U and V :
U(n×1) =
u1
u2
.
.
.
un
and V(n×1) =
v1
v2
.
.
.
vn
U ∗ V =
u1v1
u2v2
.
.
.
unvn
(34)
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet Classifica-
tion with Deep ConvolutionalNeural Networks,” Proceedings of the
Advances in neural information Processing Systems, pp. 1097–1105,
2012.
[2] K. Fukushima, “ Neocognitron: A Self-organizing Neural Network
Model for a Mechanism of Pattern Recognition Unaffected by Shift
in Position,” Biol. Cybernetics 36 by Springe, pp. 193–202,1980.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 08:24:47 UTC from IEEE Xplore. Restrictions apply.
6. [3] Y. LeCun, et al., “Handwritten digit recognition with a backpropagation
network,” in Advances in neural information processing systems, pp.
396–404, 1990.
[4] Y. LeCun, L.Bottou, Y.Bengio, and P.Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, pp.
2278–2324, 1998.
[5] D. Steinkraus, I. Buck, and P.Y. Simard, “Using GPUs for machine
learning algorithms,” in Document Analysis and Recognition, 2005.
Proceedings. Eighth International Conference, pp. 1115–1120,2005.
[6] K.Simonyan, and A.Zisserman, “ Very deep convolutional networks for
large-scale image recognition,” Conference paper at ICLR, pp. 1–14,
2015.
[7] K. He, X. Zhang, S. Ren,and J. Sun, “ Deep Residual Learning for
Image Recognition,” 2016 IEEE Conference on Computer Vision and
Pattern Recognition , pp.1–9,2016.
[8] S.Xie, R.Girshick, P.Dollar, Z. Tu, and K.He, “ Aggregated Residual
Transformations for Deep Neural Networks,” 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2017.
[9] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi , “Inception-v4,
Inception-ResNet and the Impact of Residual Connections on Learning,”
arXiv Prepr. arXiv1602.07261v2, vol. 131, no. 2,pp. 262–263,2016.
[10] H.Jang, H-Jun. Yang, D-S. Jeong, and H.Lee, “ Object classification
using CNN for video traffic detection system,” 2015 21st Korea-Japan
Joint Workshop on Frontiers of Computer Vision (FCV), 2015.
[11] H.Yanagisawa, T.Yamashita, and H.Watanabe, “ A Study on Object
Detection Method from Manga Images using CNN ,”2018 International
Workshop on Advanced Image Technology (IWAIT), 2018.
[12] R. B. Arif, M. A. B.Siddique, M. M. R. Khan, and M.R. Oishe, “Study
and Observation of the Variations of Accuracies for Handwritten Digits
Recognition with Various Hidden Layers and Epochs using Convolu-
tional Neural Network ,” 4th International Conference on Electrical
Engineering and Information and Communication Technology,2018.
[13] V.Gullapalli, “ A COMPARISON OF SUPERVISED AND REIN-
FORCEMENT LEARNING METHODS ON A REINFORCEMENT
LEARNING TASK, ” Proceedings of the 1991 IEEE International
Symposium on Intelligent Control, pp. 394–399, 1991.
[14] N.Wagaa,and H. Kallel, “Recursive Supervised Artificial Neural Net-
work Algorithm for Data Classification and Regression,” unpublished.
[15] M. A. B. Siddique, M. M. R. Khan, R. B. Arif, and Z. Ashrafi,
“Study and Observation of the Variations of Accuracies for Handwritten
Digits Recognition with Various Hidden Layers and Epochs using
Neural Network Algorithm,” 4th International Conference on Electrical
Engineering and Information and Communication Technology, pp. 118–
123,2018.
[16] V. E .Ismailov, “On the approximation by neural networks with bounded
number of neurons in hidden layers,” Journal of Mathematical Analysis
and Applications, pp. 963–969,2014.
[17] M .Rastegari, V.Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
ImageNet Classification Using Binary Convolutional Neural Networks,
” Lecture Notes in Computer Science Springer, pp. 525–542, 2016.
[18] D . Scherer, A.M¨uller, and S. Behnke, “Evaluation of Pooling Opera-
tions in Convolutional Architectures for Object Recognition, ” Interna-
tional Conference on Artificial Neural Networks Springer, pp. 92–101,
2010.
[19] I. B.Dlimi, and H.Kallel, “Robust Neural Control for Robotic Ma-
nipulators,” International Journal of Enhanced Research in Science
Technology, and Engineering, vol.5, no.2, pp. 198–205, 2016.
[20] M . DONG, Y.LI, X.TANG, J. XU, S.BI, and A. Y.CAI, “Variable Con-
volution and Pooling Convolutional Neural Network for Text Sentiment
Classification, ” IEEE Access, 2020.
[21] X .Glorot , and Y.Bengio, “Understanding the difficulty of training
deep feedforward neural networks, ” 13th International Conference on
Artificial Intelligence and Statistics, pp. 249–256, 2010.
[22] K .He, X.Zhang, S.Ren, and J.Sun, “Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification, ”
2015 IEEE International Conference on Computer Vision, pp. 1026–
1034, 2015.
[23] P .Dangeti , “Statistics for Machine Learning: Techniques for explor-
ing supervised, unsupervised, and reinforcement learning models with
Python and R, ” Packt Publishing, 2017.
[24] T .Takase , S.Oyama, and K.Masahito, “Effective neural network training
with adaptive learning rate based on training loss, ” Neural Networks,
pp. 68–78, 2018.
[25] Y . LeCun , “The MNIST database of handwritten digits,
”http://yann.lecun.com/exdb/mnist/”, 1998.
[26] I. B.Dlimi, and H.Kallel, “Optimal neural control for constrained robotic
manipulators,” 2010 5th IEEE International Conference Intelligent
Systems,pp.302–308, 2010.
Authorized licensed use limited to: Tsinghua University. Downloaded on December 19,2020 at 08:24:47 UTC from IEEE Xplore. Restrictions apply.