2933bf63f71e22ee0d6e84792f3fec1a.pdf

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2886368, IEEE Sensors
Journal
JOURNAL OF L
A
TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Convolutional Neural Network with Second-order
Pooling for Underwater Target Classification
Xu Cao, Student Member, IEEE, Roberto Togneri, Senior Member, IEEE, Xiaomin Zhang,
and Yang Yu, Member, IEEE,
Abstract—Underwater target classification using passive sonar
remains a critical issue due to the changeable ocean environment.
Convolutional Neural Networks (CNNs) have shown success in
learning invariant features using local filtering and max pooling.
In this paper, we propose a novel classification framework which
combines the CNN architecture with the second-order pooling
(SOP) to capture the temporal correlations from the time-
frequency (T-F) representation of the radiated acoustic signals.
The convolutional layers are used to learn the local features with a
set of kernel filters from the T-F inputs which are extracted by the
constant-Q transform (CQT). Instead of using max pooling, the
proposed SOP operator is designed to learn the co-occurrences
of different CNN filters using the temporal feature trajectory
of CNN features for each frequency subband. To preserve the
frequency distinctions, the correlated features of each frequency
subband are retained. The pooling results are normalized with
signed square-root and l2 normalization, and then input into
the softmax classifier. The whole network can be trained in
an end-to-end fashion. To explore the generalization ability to
unseen conditions, the proposed CNN model is evaluated on
the real radiated acoustic signals recorded at new sea depths.
The experimental results demonstrate that the proposed method
yields an 8% improvement in classification accuracy over the
state-of-the-art deep learning methods.
Index Terms—underwater target classification, convolutional
neural networks, second-order pooling, constant-Q transform.
I. INTRODUCTION
UNDERWATER target classification is aimed to detect and
recognize the marine vessels with the radiated acoustic
signals recorded by the passive sonar. It has many important
applications in ocean engineering, such as automatic target
recognition (ATR) and marine monitoring. The task can be
formulated as a feature representation problem where the
discriminative characteristics are learned from the received
acoustic signals for classification. However, when applied
in practical situations, robustness and generalization ability
to environment variation are significant for passive sonar
target classification, especially from single-sensor recordings.
Several factors affect the performances of the classification
systems, including the lack of a priori knowledge of the targets,
the various working conditions of the same class such as the
speed and the power configuration and the unpredictable ocean
X. Cao, X. Zhang and Y. Yu are with the School of Marine
Science and Technology, Northwestern Polytechnical University, Xi’an
710072, China(e-mail: caoxu@mail.nwpu.edu.cn; xmzhang@nwpu.edu.cn; n-
wpuyuy@nwpu.edu.cn).
R. Togneri is with School of Electrical, Electronics and Computer Engineer-
ing, The University of Western Australia, Perth, WA 6009, Australia(e-mail:
roberto.togneri@uwa.edu.au).
Manuscript received August 15, 2018; revised November 5, 2018.
background noise. Consequently, more adaptive and robust
classification models are needed to deal with this problem.
Several pattern recognition methods have been developed
for underwater target classification with different features
extracted from the radiated acoustic signals. In [1], the features
generated from the wavelet packet transform (WPT) and the
linear predictive coding (LPC) are put into the neural network
(NN) classifier. In [2], a Hidden Markov Model (HMM) is
used for multiaspect target detection and identification. In
[3], a preprocessing method is developed to improve the
performance of a feedforward neural network (NN) for passive
sonar signal classification. A novel class detection scheme
utilizing a clustering approach on an unsupervised neural
network based Self-Organizing Map (SOM) is proposed in [4].
In [5], canonical correlation analysis (CCA) is employed as a
multiaspect feature extraction method for underwater target
classification. In [6], a K-nearest neighbor (K-NN) system
is used as a memory to provide the closest matches of an
unknown pattern in the feature space. In the past few years,
support vector machines (SVMs) have seen an increased usage
in applications of underwater water target classification. The
method in [7] proposes to adopt SVM as the classifier for
features captured using the Hilbert-Huang transform (HHT). In
[8], an underwater acoustic feature extraction and classification
method based on the Wigner-Ville distribution (WVD) and the
SVM is presented.
Compared with conventional machine-learning systems
based on a priori knowledge, deep networks are able to
hierarchically learn the high-level features from the large
number of samples, and the extracted deep features are more
robust to invariants [9–11]. In [12], Kamal et al. proposed
to incorporate the Deep Belief Network (DBN) to capture
several layers of deep features from the underwater acoustic
signals, which are more abstract at the higher layers. Our
past work in [13] utilizes a Stacked Autoencoder (SAE)
for feature learning with the short time frequency transform
(STFT), which provides competitive performance. However,
these fully-connected networks demand huge collections of
training samples for effective training, especially when applied
to multiple-frame T-F features.
In recent years, Convolutional Neural Networks (CNNs)
have been successfully applied to many pattern recognition
tasks with local connectivity and weight sharing [14–16].
Compared with the fully-connected deep models, these popular
CNN architectures use a set of filters which process the local
parts of the whole input to capture the detail characteristics.
Usually, the max-pooling is used to generate holistic and

Journal
JOURNAL OF L
A
invariant representations from the CNN features. However, the
max-pooling just focuses on the first-order statistics in the
local regions of the CNN features. For underwater acoustic
signals which have strong temporal relations, max-pooling
may ignore the high-level correlations in the time domain. The
second-order pooling (SOP) has shown success in computer
vision tasks to capture the second-order correlations of the
local features [17]. In this paper, we propose to learn the
second-order temporal correlations of the CNN features for
underwater target classification. The proposed SOP strategy
is designed to compute the co-occurrences of different CNN
filters using the temporal feature trajectory of CNN features as
input. Compared to the max-pooling, the proposed SOP strat-
egy is capable of exploring the second-order co-occurrences
for the CNN feature maps of underwater acoustic signals to
improve the classification performance.
The constant-Q transform (CQT) is popular in music sig-
nal processing since the bin frequencies of the CQT scale
have a perceptually relevant geometrical distribution [18–20].
Compared to the STFT, the CQT can provide a better fre-
quency resolution for lower frequencies and a better temporal
resolution for higher frequencies [21]. The radiated signal
of an underwater target contains much useful information
in the low frequency subbands, such as the line spectrum
components, which are related to the propeller’s turning. The
greater resolution in the low frequencies of the CQT can
contribute to a more robust feature representation. In this study,
unlike [12, 13], we use the CQT as the T-F representation
method for underwater target classification.
In this paper, a new underwater target classification frame-
work based on the CNN model is proposed. Our work focuses
on the second-order pooling (SOP) strategy for the CNN
feature maps. The proposed method is named the CNN-SOP
model. For each frequency subband, the pooling operation
takes a temporal sequence of every CNN feature map as
input to compute the similarities between these temporal
features of different CNN filters. The correlation features of
different frequency subbands are then passed through a signed
square-root step and l2 normalization to generate the final
feature vector, which is input into the softmax classifier for
classification. Furthermore, we propose to use the CQT to
generate the T-F representation for the CNN-SOP model. Since
the generalization ability to unseen conditions is significant in
practical applications, the proposed classification method is
tested on the real radiated acoustic signals recorded at new
depths. The results show that the proposed method achieves
an 8% improvement compared to other deep learning-based
approaches. The proposed second-order pooling strategy is
shown to improve the classification accuracy by further 4%
over the max pooling.
The rest of this paper is organized as follows: Section II
introduces the related work of the CNN architecture and the
pooling strategy. Section III details the proposed CNN-SOP
model. The experimental results of this method is provided
in Section IV. In Section V we draw our conclusions of this
work.
II. RELATED WORK
Recently, CNN architectures have increasingly been used in
acoustic signal recognition. Approaches developed for image
recognition [22] can be extended to signal classification by
regarding the T-F representation (e.g. spectrogram and MFCC)
of raw signals as an image. In [23], the CNN model is
introduced in acoustic event detection to capture the local
properties of acoustic events, which provides competitive
performance in the evaluation task. In [24], the CNN networks
in conjunction with different data augmentation methods are
applied to environmental sound classification. In [25], the
performances of different auditory and spectrogram image
features using CNN models are evaluated. In [26], the CNN
architecture is integrated with the SVM classifier to improve
the overall classification performance of the real-time signals.
In [27], a CWT and CNN-based fault detection method is
proposed to extract the comprehensive T-F features of fault
signals.
A difficulty when extending the regular CNN-based meth-
ods to acoustic signals is that the translation invariability
in frequency may not be appropriate since the difference in
frequency bands usually means a different class. This problem
also exists in underwater acoustic signals since the spectrum
distributions of various vessels differ a lot. One may overcome
this difficulty by presenting a novel deep convolutional neural
network architecture, where heterogeneous pooling is used to
provide constrained frequency-shift invariance in the speech
spectrogram [28]. In [29], a parallel CNN architecture is
created, which comprises a CNN layer which is optimized for
processing and recognizing relations in the frequency domain,
and a parallel one which is aimed at capturing temporal
relations. Another promising CNN network to deal with this
problem is to add an intermap pooling (IMP) layer to increase
robustness to spectral variations [30].
Second-order pooling methods have been widely used in
many computer vision tasks. Our proposed pooling approach
is inspired by the second-order pooling scheme in [17] which
summarizes sets of local features inside a free-form region,
while preserving information about their pairwise correlations.
However, this approach uses second-order pooling directly on
raw local descriptors such as SIFT while we apply the SOP
to the CNN feature maps in this work. In [31], a bilinear
CNN (B-CNN) model is proposed for image classification
which consists of two feature extractors based on CNNs whose
outputs are multiplied using the outer product at each location
to obtain the bilinear vector. When using the same CNN
extractor, the bilinear pooling used in the B-CNN model can
be seen as a second-order pooling approach. An improved
bilinear pooling method for CNN features is proposed in [32]
which proposes to use the matrix square-root normalization to
improve the classification performance. In [33], two compact
bilinear representations are proposed to reduce the dimensions
of the full bilinear models. Since the T-F representation is
different from the image input, in contrast to these B-CNN
models, our SOP method just focuses on the temporal corre-
lations and preserves the correlation matrix for each frequency
subband, which can retain the spectral variation characteristics

Journal
JOURNAL OF L
A
of different classes. Another second-order temporal pooling
is proposed for action recognition in [34], which uses the
temporal classification scores to generate the descriptor rather
than the CNN features.
III. THE PROPOSED SYSTEM
The whole framework of the proposed CNN-SOP model
is described in Fig. 1. In the preprocessing stage, the raw
radiated acoustic signals are converted into a time-frequency
representation using the CQT. Multiple frames of the CQT
representation are combined to generate the input for the CNN
network. Instead of using max pooling, we adopt second-order
pooling for the CNN feature maps to obtain the temporal
correlation features of the input. Elementwise square-root and
l2 normalization are used to further improve the performance.
The whole network can be trained end-to-end with back-
propagation.
A. Preprocessing using CQT
For underwater radiated signals which belong to the non-
stationary signals, the T-F representation approaches have been
shown to be more effective for feature extraction. The CQT
can transform the time-domain signal to the T-F domain
such that the center frequencies of the frequency bins are
geometrically spaced and their Q-factors are all equal [35].
That means the CQT can provide a better frequency resolution
for low frequency subbands compared to the STFT, and can
show more details about the low-frequency components. In this
paper, we propose to use the CQT to deal with the radiated
acoustic signals.
Given a discrete time-domain signal x(n), the CQT is
defined as:
XCQ
(k, n) =
n+⌊Nk/2⌋
∑
j=n−⌊Nk/2⌋
x(j)a∗
k(j − n + Nk/2) (1)
where k = 1, 2, . . . , K represents the K frequency bins of the
CQT, and a∗
k(n) is the complex conjugate of the basis function
[35]. The Nk denotes the window length which is set to be
variable.
The center frequency of the kth
bin is defined by:
fk = f12
k−1
B (2)
where f1 is the center frequency of the lowest -frequency bin
and B is the number of bins for each octave, which determines
the time-frequency resolution trade-off of the CQT. Then, the
total number of frequency bins K of the CQT can be computed
as:
K = B(log2
fmax
f1
+ 1) (3)
where fmax is the center frequency of the highest-frequency
bin.
In our work, we propose to use the CQT to obtain the T-F
features for the CNN model. The CQT T-F feature is derived
from multiple frames as follows:
X =
{
X1
, X2
, . . . , XN
}
(4)
Convolutional
layer 1
Convolutional
layer 2
Convolutional
layer L
Input
1
f 2
f
1
W 2
W
X 1
H 2
H L
H
Fig. 2. The CNN architecture.
where
Xi
= 20 log10 ||XCQ
(i)|| (5)
and Xi
is the CQT feature for frame i and N denotes the total
number of frames. XCQ
(i) ∈ RK
is the complex-valued CQT
vector of the K frequency bins representing frame i.
B. CNN architecture
In contrast to the fully-connected layers, CNNs are designed
to restrict the connections between the hidden units and the
input units, which means that each hidden unit is supposed
to connect to only a small neighborhood of input units. The
locally connected structure also makes it possible for CNNs
to model the local correlations of the input. By replicating
weights across the whole input, the parameters of the convo-
lutional layers are reduced. In this paper, we propose to use the
CNN model comprised of L convolutional layers to learn the
deep representation of the CQT feature. The CNN architecture
is described in Fig. 2. Unlike the regular CNN models, the
max-pooling layers are not adopted in the network since the
resolution is important for classification.
Given our input X, the CNN model is supposed to learn
the nonlinear representation f which maps the input X to the
Lth
output HL
:
HL
= f(X) = fL(· · · f2(f1(X; W1); W2) · · · , WL) (6)
where fl is the mapping function of the lth
convolutional layer,
which takes the input Hl
to generate the feature maps Hl+1
with the filter parameter Wl. The convolutional layers are
constructed by the rectified linear units (ReLUs). The detail
of the convolutional process can be found in [36]. The feature
maps of the last layer HL
is a h × w × c array, where h and
w denote the height and width of the feature map, and c is
the number of the feature maps.
C. Second-order pooling
The T-F representation of the radiated acoustic signals has
strong temporal correlations, which can help to discriminate
different targets. In this work, we propose to use a second-
order pooling scheme for the CNN features to capture the
temporal correlations of the CQT input.
Since the CNN feature maps HL
are learned from the CQT
feature X, h and w correspond to the frequency bins and the
temporal frames of the CQT input. For each frequency bin of
the feature maps, we denote sm
= [sm
1 , sm
2 , . . . , sm
w ] ∈ Rw
as
the temporal feature trajectory of the mth
feature map (see
Fig. 3).

Journal
JOURNAL OF L
A
CQT input
Second-order
pooling
h w c
´ ´ c c h
´ ´
Linear
+
Normalization
Dense layer
+
Softmax classifier
m
s
CNN architeture
K N
´
CNN feature maps Pooling result Resulting vector
l u
z
( )
SOP S
L
H
X
Class scores
Preprocessing
Original signal
Fig. 1. The framework of the proposed CNN-SOP system.
m
s
S ( )
SOP S
c w
´ c c
´
T
SS
Fig. 3. The second-order pooling operation
The second-order pooling operator is defined as:
SOP(sj
, sk
) =
w
∑
i=1
sj
i sk
i = sjT
sk
(7)
where SOP(sj
, sk
) represents the temporal correlations of
two feature trajectories sj
and sk
from the jth
and kth
feature
maps. The SOP operator is designed to capture the interactions
of two convolution filters along the time axis. For c feature
maps, we denote S ∈ Rc×w
as the temporal feature matrix,
then the SOP operator can be defined in matrix form as:
SOP(S) = SST
(8)
where SOP(S) ∈ Rc2
is a symmetric positive semidefinite
matrix, which captures the temporal correlations of all the
CNN filters for one frequency bin.
Since the differences in the frequency bins are useful to
distinguish the underwater acoustic signals, unlike the pooling
strategy in [31] which use sum-pooling to aggregate the
correlations across the whole image, we retain the SOP results
of all frequency bins to preserve the frequency distinctions for
classification. The last SOP feature is shown in Fig. 1, which
consists of h SOP operators corresponding to the height of the
feature maps.
It is often found that normalization offers significant im-
provements to the deep network. In this work, we incorporate
the elementwise square-root and l2 normalization for the SOP
operators. The resulting SOP operators are first transferred
into the vector p ∈ Rl
, where l = c × c × h. Then, the
resulting vector p is passed through the elementwise square-
root (q ← sign(p)
√
p) and l2 normalization (z ← q/||q||2).
For the CNN feature maps of size h × w × c, the computa-
tional complexity of our proposed SOP strategy is O(hwc2
),
which is the same as the bilinear pooling in [31], while the
max pooling is O(hwc).
D. Softmax classification
The resulting vector of second-order pooling z is then input
to the softmax layer for classification after a dense layer. The
class scores for the ith
sample z(i)
to the category j can be
computed as follows:
p(y(i)
= j|a(i)
; θ) =
eθT
j a(i)
u
∑
t=1
eθT
t a(i)
(9)
where
a(i)
= fL+1(z(i)
; WL+1) (10)
and a(i)
∈ Rl1
is the output activation at the dense layer for
the ith
sample, and l1 and WL+1 denote the node number
and the model parameter of the dense layer. We still use the
ReLU for the mapping function fL+1. The θj ∈ Rl1
denotes
the parameter of the softmax layer for the jth
unit, and u is
the total number of classes.
In this paper, we use the cross-entropy loss function as
the objective function [37]. Since the second-order pooling
and the normalization steps are both differentiable, the back-
propagation can be used to calculate the gradient [31]. Then
we fine-tune the whole model using the Adam optimization
algorithm. The whole model can be trained end-to-end.
IV. EXPERIMENTS AND RESULTS
This section provides experiments to evaluate the perfor-
mance of our proposed CNN-SOP model for underwater target
classification. The experiments were performed on the real
radiated acoustic signals of 5 marine vessels. The advantage
of the proposed SOP scheme was verified by comparing
with the max pooling and the bilinear pooling [31]. We also
compared the classification accuracy of the proposed method
with previous deep learning methods, such as the DBN model
[12] and the SAE model [13].

Journal
JOURNAL OF L
A
TABLE I
DETAILS ABOUT THE DATASET INCLUDING THE NUMBER OF SAMPLES FOR
EACH VESSEL USED IN THE TRAINING OR TESTING SET
Depth (m) A B C D E Dataset
50 2880 2640 2880 1200 3600 Training
150 5520 6480 1200 4320 800 Training
70 2880 5680 920 2880 640 Testing
100 4800 4560 3600 3360 4800 Testing
200 2640 1640 560 1680 560 Testing
A. Experimental setup
In the experiments, the radiated acoustic signals were
recorded with a single-hydrophone from the South China Sea
in 2015. The hydrophone was placed below the sea level
at 5 depths (50m,70m,100m,150m and 200m). The radiated
signals were collected from 5 different vessels, which had
various weight, size, propeller structure and engine system.
The sampling rate of the signals was 50 kHz. For each run, the
portion of the recording when the vessel ranged from +500m
to -500m was selected.
In the preprocessing stage, the raw radiated signals were
transferred into CQT features. The signals were resampled
at the sampling rate of 4 kHz. We used the Matlab toolbox
to compute the CQT representation [38]. For the radiated
signals, we just focused on the frequency below 1 kHz.
The center frequency of the lowest-frequency bin f1 and
the highest frequency bin fmax were set to be 4 Hz and 1
kHz, respectively. The bin number for each octave B was 8.
Thus, the CQT can capture 64 bands covering 8 octaves. Each
single CQT feature frame can be computed using 23 points (5
milliseconds). We combined 64 frames for each CQT feature
to generate the input sample of the CNN model, then each
sample had the size of 64×64, which was derived from 1472
points (0.32 seconds). Since the radiated signals recorded from
different depths have various signal-to-noise ratios (SNRs),
to evaluate the generalization ability to unseen conditions,
we trained the proposed CNN-SOP model with the samples
generated from the depths of 50m and 150m, while testing the
model with the samples at depths of 70m, 100m and 200m.
The training set contained 31520 input samples and the testing
set had 41200 samples. The details of the whole dataset are
presented in Table I.
The proposed CNN model contained several convolutional
layers which had the same filter size of 8 bands × 8 frames
and the strides size of 2 bands × 2 frames. The whole model
was optimized using the Adam optimizer with the learning
rate of 0.0001. The network was trained for 1000 epochs with
a minibatch size of 50. Our implementation was developed
upon Tensorflow using a NVIDIA Tesla K40 GPU.
B. Comparison with the CNN model using max pooling
We first compared the performance of the proposed second-
order pooling based CNN model (CNN-SOP) with the CN-
N model using max-pooling (CNN-MP). For the CNN-MP
model, the CNN feature maps of the last convolutional layer
were pooled with the pooling size of 2 bands × 2 frames
CQT input (64× 64)
Conv. 32× 32× 8
Conv. 16× 16× 16
SOP 16× 16× 16
Norm. 4096
Dense 1024
Softmax 5
CNN-SOP (2L)
CQT input (64× 64)
Conv. 32× 32× 8
Conv. 16× 16× 16
MP. 8× 8× 16
Dense 1024
Softmax 5
CNN-MP (2L)
Fig. 4. Model structure of the CNN-SOP model and the CNN-MP model.
and the sub-sampling factor of 2 × 2. We have tested the
two CNN models with different convolutional layers from
1 to 4. The model structure for 2 convolutional layers is
described in Fig. 4. To reduce the computational complexity
of the SOP, the number of the CNN filters c was set to a
small number, in this case 16. The resolution is important for
the CQT feature of underwater acoustic signals, especially the
frequency resolution since the CQT features for some targets
are very similar in the frequency domain. To obtain more
discriminative features, we just added one max-pooling layer
after the final convolutional layer for the two CNN models.
Both networks were evaluated using the dataset in Table I.
Fig. 5 shows the classification accuracies of the CNN-SOP
model and the CNN-MP model with different convolutional
layers. It can be seen that the proposed CNN-SOP network
achieves better performance compared to the regular CNN-MP
model, with an improvement of 4% in overall classification
accuracy. We also found that deeper CNNs may not always
produce better results in our experiments when applied to the
two CNN models, and both CNN models yield the highest
accuracies when the number of the convolutional layers set to
be 2. This may be explained by considering that the number
of training samples is limited and we have used a wider CNN
filter (8 × 8), thus using fewer convolutional layers may
be more efficient than a larger number of layers. It can be
observed that the classification accuracies of these two models
decline with the increase of the sea depths. This may be due
to the SNRs of the radiated signals degrading at greater depths
and increasing between the surface vessel and hydrophone.
To explore the individual target performance of these two
models, we also use the confusion matrix to show the clas-
sification results. Both networks have 2 convolutional layers,
which proves to be the best configuration. We can see from
Fig. 6 that the CNN-SOP model provides better classification
accuracies than the CNN-MP model for all targets.
C. Comparison with STFT feature
The STFT feature has been used as the input for a DBN
model to provide the spectrum information of the radiated
signals [12]. To evaluate the advantages of the CQT feature,
the STFT feature was used for comparison. Similar to [12],

Journal
JOURNAL OF L
A
1 2 3 4
Convolutional layer number
0.8
0.85
0.9
0.95
1
Accuracy
CNN-MP
CNN-SOP
1 2 3 4
0.8
0.85
0.9
0.95
1
Accuracy
CNN-MP
CNN-SOP
1 2 3 4
0.8
0.85
0.9
0.95
1
Accuracy
CNN-MP
CNN-SOP
1 2 3 4
0.8
0.85
0.9
0.95
1
Accuracy
CNN-MP
CNN-SOP
Fig. 5. Classification accuracies of the CNN-SOP model and the CNN-MP
model with different convolutional layers for the dataset at various depths
of 70m (upper-left), 100m (upper-right), 200m (lower-left), and the overall
results (lower-right).
the STFT feature was calculated with 1024 FFT points and a
sampling rate 4 kHz. We concatenated 8 frames to generate
the input for the CNN model. The last input STFT feature
had the size of 512 dimensions × 8 frames. In this section,
we still applied the CNN-SOP model and the CNN-MP model
to the STFT feature for comparison. When using the STFT
feature, the filter size and the strides size of the convolutional
layers were both set to be 8 bands × 1 frame. We still used
2 convolutional layers for the CNN models with the STFT
feature, which have the feature size of 64 × 8 × 8 and 8 ×
8 × 16 for the two convolutional layers.
The classification results of the CQT feature and the STFT
feature using two CNN models are compared in Fig. 7. It can
be seen that the CQT feature offers a 3% improvement over
the STFT feature using the CNN-SOP model, and a 1.6%
improvement using the CNN-MP model. This demonstrates
that the CQT feature is more appropriate for the CNN model
compared with the STFT feature when applied to radiated
acoustic signals, which may be explained by the better res-
olution at the lower frequencies.
D. Comparison with other pooling methods
To verify the effectiveness of the proposed SOP strategy,
we compared the proposed SOP with the bilinear pooling in
[31]. In [31], the B-CNN model is proposed which applies
bilinear pooling to the VGG-16 network [39]. When using the
same CNN extractor, the bilinear pooling can be seen as a
second-order pooling approach. In this section, three pooling
0.9603
0.0279
0.0000
0.0000
0.0322
0.0312
0.9721
0.0274
0.0000
0.0000
0.0000
0.0000
0.9315
0.0216
0.0092
0.0000
0.0000
0.0411
0.9784
0.0000
0.0085
0.0000
0.0000
0.0000
0.9587
class A
class B
class C
class D
class E
c
l
a
s
s
A
c
l
a
s
s
B
c
l
a
s
s
C
c
l
a
s
s
D
c
l
a
s
s
E
0.9062
0.0380
0.0219
0.0000
0.0675
0.0589
0.9374
0.0128
0.0000
0.0000
0.0000
0.0000
0.9041
0.0606
0.0138
0.0000
0.0000
0.0612
0.9394
0.0000
0.0349
0.0246
0.0000
0.0000
0.9187
class A
class B
class C
class D
class E
c
l
a
s
s
A
c
l
a
s
s
B
c
l
a
s
s
C
c
l
a
s
s
D
c
l
a
s
s
E
Fig. 6. Confusion matrix for the overall classification accuracy of the CNN-
SOP model (upper) and the CNN-MP model (lower). X-axis indicates the
predicted label and Y-axis indicates the true label.
70m 100m 200m Overall
0.85
0.88
0.91
0.94
0.97
1
Accuracy
STFT+CNN-MP
STFT+CNN-SOP
CQT+CNN-MP
CQT+CNN-SOP
Fig. 7. Classification results of the CQT feature and the STFT feature using
two CNN models.
approaches based on the VGG network were used for com-
parison with the same CQT feature, which were the proposed
SOP, the bilinear pooling [31] and the max-pooling. However,
since the standard VGG-16 network has 16 convolutional
layers, leading to too many parameters to train, the standard
VGG may not be suitable for our limited dataset. Thus we
considered using a modified VGG-16 network consisting of

Journal
JOURNAL OF L
A
the first 7 convolutional layers, three pooling layers and one
dense layer in the experiment. Unlike the CNN model used
in IV.B which adopted the convolutional filter of size 8 × 8,
the VGG network used smaller (3 × 3) filters. We also used
fewer filters in each convolutional layer of the modified VGG
network, with (16-32-64) filters for the three convolutional
groups. The single dense layer had 4096 units like the standard
VGG network.
The max-pooling of the modified VGG network (VGG-MP)
was similar to the standard VGG, which was performed over a
2 × 2 window with stride 2. The B-CNN model based on the
modified VGG network had 64 filters in the final convolutional
layer, thus the bilinear feature dimension was 64×64 = 4096.
We also applied the proposed SOP strategy on the same VGG
network above (VGG-SOP) for comparison. The CNN feature
of the final convolutional layer had the size of 8 × 8 ×
64, which meant that the SOP feature had the dimension of
8 × 64 × 64 = 32768. The elementwise square-root and l2
normalization were used before the final classification for the
SOP and the bilinear pooling. The learning rate of the Adam
optimizer was set to 0.001. The network was trained for 600
epochs with a minibatch size of 64.
It can be seen from Fig. 8 that when using the same
VGG network, the VGG-SOP outperforms the B-CNN model
[31] by nearly 2% and the max-pooling by 3%. The results
shows that compared to the bilinear pooling, the proposed SOP
strategy can take advantage of the local discrimination along
the frequency axis, which is more suitable for classification of
underwater acoustic signals.
70m 100m 200m Overall
0.8
0.83
0.86
0.89
0.92
0.95
Accuracy
VGG-MP
B-CNN [31]
VGG-SOP
Fig. 8. Classification results of the proposed SOP, the bilinear pooling and
the max pooling based on the VGG network.
E. Comparison with previous DNN-based classification mod-
els
In this section, we compared the classification accuracy
against other deep learning-based underwater target classifi-
cation systems [12, 13] with our dataset. We have applied
the CQT to the DBN model [12] and the SAE model [13] for
comparison. Since the DBN and SAE are both fully-connected
deep networks, the input CQT sample has the dimension of
4096 (64 bands × 64 frames), which may lead to too many
parameters and a heavy computational load. Thus we extracted
the averages across consecutive 8 frames from the original 64
TABLE II
COMPARISON OF THE PROPOSED CNN-SOP MODEL WITH THE DBN
MODEL AND THE SAE MODEL USING THE CQT FEATURE IN TERMS OF
CLASSIFICATION ACCURACY
Method 70 m 100 m 200 m Overall
DBN [12] 0.8941 0.8707 0.8305 0.8712
SAE [13] 0.9052 0.8819 0.8553 0.8847
Proposed CNN-SOP 0.9714 0.9656 0.9421 0.9634
frames to generate the CQT features for the DBN and SAE,
which had the dimension of 512 (64 bands × 8 frames). The
model structures of the DBN and SAE were similar to [12]
and [13]. The DBN model had 3 hidden layers (200-100-50)
while the SAE model was composed of 3 autoencoders with
100 units. We can see from Table II that the proposed CNN-
SOP model improves the overall classification accuracy by 8%
compared to the DBN and SAE model when using the same
CQT input. This shows that our CNN-SOP model has a great
advantage over these fully-connected networks.
V. CONCLUSION
In this paper, we have introduced a novel CNN model using
second-order pooling to capture the temporal correlations
for underwater target classification. The radiated signals are
transformed into a T-F feature using the CQT as the inputs to
the CNN model. The proposed second-order pooling learns the
temporary similarities of different CNN filters by computing
the covariance matrix of the CNN feature maps along the time
axis. The experimental results on the real radiated acoustic
signals recorded under various depths demonstrate that the
second-order pooling achieves better performance over the
max pooling under various sea depths. The CQT feature
has also been demonstrated to be more effective than the
STFT feature when applied to the proposed CNN model.
The proposed CNN-based classification approach improves the
classification accuracy by 8% compared with the state-of-the-
art deep learning methods.
ACKNOWLEDGMENT
The research was supported by the National Science Foun-
dation of China (Grant no. 61601369). This work was com-
pleted when the first author was a visiting student in the School
of Electrical, Electronic and Computer Engineering, University
of Western Australia. We gratefully acknowledge the support
of NVIDIA Corporation with the donation of the Tesla K40
GPU used for this research.
REFERENCES
[1] M. R. Azimi-Sadjadi, D. Yao, Q. Huang, and G. J.
Dobeck, “Underwater target classification using wavelet
packets and neural networks,” IEEE Transactions on
Neural Networks, vol. 11, no. 3, pp. 784–794, 2000.
[2] S. Ji, X. Liao, and L. Carin, “Adaptive multiaspect target
classification and detection with hidden markov models,”
IEEE Sensors Journal, vol. 5, no. 5, pp. 1035–1042,
2005.

Journal
JOURNAL OF L
A
[3] J. De Seixas, N. De Moura et al., “Preprocessing passive
sonar signals for neural classification,” IET radar, sonar
& navigation, vol. 5, no. 6, pp. 605–612, 2011.
[4] S. Kamal, A. Mujeeb, M. Supriya et al., “Novel class
detection of underwater targets using self-organizing
neural networks,” in Underwater Technology (UT), 2015
IEEE. IEEE, 2015, pp. 1–5.
[5] A. Pezeshki, M. R. Azimi-Sadjadi, and L. L. Scharf,
“Undersea target classification using canonical correla-
tion analysis,” IEEE Journal of Oceanic Engineering,
vol. 32, no. 4, pp. 948–955, 2007.
[6] M. R. Azimi-Sadjadi, D. Yao, A. A. Jamshidi, and G. J.
Dobeck, “Underwater target classification in changing
environments using an adaptive feature mapping,” IEEE
Transactions on neural networks, vol. 13, no. 5, pp.
1099–1111, 2002.
[7] S. Wang and X. Zeng, “Robust underwater noise target-
s classification using auditory inspired time–frequency
analysis,” Applied Acoustics, vol. 78, pp. 68–76, 2014.
[8] Y. Wu, X. Li, and Y. Wang, “Extraction and classification
of acoustic scattering from underwater target based on
wigner-ville distribution,” Applied Acoustics, vol. 138,
pp. 52–59, 2018.
[9] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mo-
hamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,
T. N. Sainath et al., “Deep neural networks for acoustic
modeling in speech recognition: The shared views of
four research groups,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 82–97, 2012.
[10] S.-H. Fang, Y.-X. Fei, Z. Xu, and Y. Tsao, “Learning
transportation modes from smartphone sensors based on
deep neural network,” IEEE Sensors Journal, vol. 17,
no. 18, pp. 6111–6118, 2017.
[11] A. Dairi, F. Harrou, Y. Sun, and M. Senouci, “Ob-
stacle detection for intelligent transportation systems
using deep stacked autoencoder and k-nearest neighbor
scheme,” IEEE Sensors Journal, vol. 18, no. 12, pp.
5122–5132, 2018.
[12] S. Kamal, S. K. Mohammed, P. S. Pillai, and M. Supriya,
“Deep learning architectures for underwater target recog-
nition,” in Ocean Electronics (SYMPOL), 2013. IEEE,
2013, pp. 48–54.
[13] X. Cao, X. Zhang, Y. Yu, and L. Niu, “Deep learning-
based recognition of underwater target,” in Digital Signal
Processing (DSP), 2016 IEEE International Conference
on. IEEE, 2016, pp. 89–93.
[14] P. Swietojanski, A. Ghoshal, and S. Renals, “Convolu-
tional neural networks for distant speech recognition,”
IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1120–
1124, 2014.
[15] X. Xiang, N. Lv, M. Zhai, and A. El Saddik, “Real-
time parking occupancy detection for gas stations based
on haar-adaboosting and cnn,” IEEE Sensors Journal,
vol. 17, no. 19, pp. 6360–6367, 2017.
[16] Y. Wang, A. Yang, X. Chen, P. Wang, Y. Wang, and
H. Yang, “A deep learning approach for blind drift
calibration of sensor networks,” IEEE Sensors Journal,
vol. 17, no. 13, pp. 4158–4171, 2017.
[17] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu,
“Semantic segmentation with second-order pooling,” in
European Conference on Computer Vision. Springer,
2012, pp. 430–443.
[18] W. J. Pielemeier and G. H. Wakefield, “A high-resolution
time–frequency representation for musical instrument
signals,” The Journal of the Acoustical Society of Amer-
ica, vol. 99, no. 4, pp. 2382–2396, 1996.
[19] W. J. Pielemeier, G. H. Wakefield, and M. H. Simoni,
“Time-frequency analysis of musical signals,” Proceed-
ings of the IEEE, vol. 84, no. 9, pp. 1216–1230, 1996.
[20] G. Costantini, R. Perfetti, and M. Todisco, “Event based
transcription system for polyphonic piano music,” Signal
Processing, vol. 89, no. 9, pp. 1798–1811, 2009.
[21] J. C. Brown, “Calculation of a constant q spectral trans-
form,” The Journal of the Acoustical Society of America,
vol. 89, no. 1, pp. 425–434, 1991.
[22] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods
for generic object recognition with invariance to pose and
lighting,” in Computer Vision and Pattern Recognition,
2004. CVPR 2004. Proceedings of the 2004 IEEE Com-
puter Society Conference on, vol. 2. IEEE, 2004, pp.
II–104.
[23] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani,
“Exploiting spectro-temporal locality in deep learning
based acoustic event detection,” EURASIP Journal on
Audio, Speech, and Music Processing, vol. 2015, no. 1,
p. 26, 2015.
[24] J. Salamon and J. P. Bello, “Deep convolutional neural
networks and data augmentation for environmental sound
classification,” IEEE Signal Processing Letters, vol. 24,
no. 3, pp. 279–283, 2017.
[25] R. Hyder, S. Ghaffarzadegan, Z. Feng, J. H. Hansen,
and T. Hasan, “Acoustic scene classification using a cnn-
supervector system trained with auditory and spectro-
gram image features,” Proc. Interspeech 2017, pp. 3073–
3077, 2017.
[26] S. Lekha and M. Suchetha, “A novel 1-d convolution neu-
ral network with svm architecture for real-time detection
applications,” IEEE Sensors Journal, vol. 18, no. 2, pp.
724–731, 2018.
[27] M.-F. Guo, X.-D. Zeng, D.-Y. Chen, and N.-C. Yang,
“Deep-learning-based earth fault detection using contin-
uous wavelet transform and convolutional neural network
in resonant grounding distribution systems,” IEEE Sen-
sors Journal, vol. 18, no. 3, pp. 1291–1300, 2018.
[28] L. Deng, O. Abdel-Hamid, and D. Yu, “A deep convo-
lutional neural network using heterogeneous pooling for
trading acoustic invariance with phonetic confusion,” in
Acoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on. IEEE, 2013, pp.
6669–6673.
[29] T. Lidy and A. Schindler, “Cqt-based convolutional
neural networks for audio scene classification,” in Pro-
ceedings of the Detection and Classification of Acous-
tic Scenes and Events 2016 Workshop (DCASE2016),
vol. 90. DCASE2016 Challenge, 2016, pp. 1032–1048.
[30] H. Lee, G. Kim, H.-G. Kim, S.-H. Oh, and S.-Y. Lee,

Journal
JOURNAL OF L
A
“Deep cnns along the time axis with intermap pooling for
robustness to spectral variations,” IEEE signal processing
letters, vol. 23, no. 10, pp. 1310–1314, 2016.
[31] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn
models for fine-grained visual recognition,” in Proceed-
ings of the IEEE International Conference on Computer
Vision, 2015, pp. 1449–1457.
[32] T.-Y. Lin and S. Maji, “Improved bilinear pooling with
cnns,” arXiv preprint arXiv:1707.06772, 2017.
[33] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact
bilinear pooling,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp.
317–326.
[34] A. Cherian and S. Gould, “Second-order temporal
pooling for action recognition,” arXiv preprint arX-
iv:1704.06925, 2017.
[35] C. Schörkhuber and A. Klapuri, “Constant-q transform
toolbox for music processing,” in 7th Sound and Music
Computing Conference, Barcelona, Spain, 2010, pp. 3–
64.
[36] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn,
“Applying convolutional neural networks concepts to
hybrid nn-hmm model for speech recognition,” in Acous-
tics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on. IEEE, 2012, pp. 4277–
4280.
[37] S. W. Abeyruwan, D. Sarkar, F. Sikder, and U. Visser,
“Semi-automatic extraction of training examples from
sensor readings for fall detection and posture monitor-
ing,” IEEE Sensors Journal, vol. 16, no. 13, pp. 5406–
5415, 2016.
[38] C. Schörkhuber, A. Klapuri, N. Holighaus, and
M. Dörfler, “A matlab toolbox for efficient perfect recon-
struction time-frequency transforms with log-frequency
resolution,” in Audio Engineering Society Conference:
53rd International Conference: Semantic Audio. Audio
Engineering Society, 2014.
[39] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” arXiv
preprint arXiv:1409.1556, 2014.
PLACE
PHOTO
HERE
Xu Cao Biography text here.
Roberto Togneri Biography text here.
Xiaomin Zhang Biography text here.
Yang Yu Biography text here.

2933bf63f71e22ee0d6e84792f3fec1a.pdf

Recommended

Recommended

More Related Content

Similar to 2933bf63f71e22ee0d6e84792f3fec1a.pdf

Similar to 2933bf63f71e22ee0d6e84792f3fec1a.pdf (20)

More from mokamojah

More from mokamojah (9)

Recently uploaded

Recently uploaded (20)

2933bf63f71e22ee0d6e84792f3fec1a.pdf