SlideShare a Scribd company logo
1 of 7
Download to read offline
Bulletin of Electrical Engineering and Informatics
Vol. 9, No. 4, August 2020, pp. 1387~1393
ISSN: 2302-9285, DOI: 10.11591/eei.v9i4.2230  1387
Journal homepage: http://beei.org
Performance analysis of the convolutional recurrent neural
network on acoustic event detection
Suk-Hwan Jung, Yong-Joo Chung
Department of Electronics Engineering, Keimyung University, South Korea
Article Info ABSTRACT
Article history:
Received Nov 11, 2019
Revised Feb 3, 2020
Accepted Mar 23, 2020
In this study, we attempted to find the optimal hyper-parameters of
the convolutional recurrent neural network (CRNN) by investigating its
performance on acoustic event detection. Important hyper-parameters such
as the input segment length, learning rate, and criterion for the convergence test,
were determined experimentally. Additionally, the effects of batch normalization
and dropout on the performance were measured experimentally to obtain their
optimal combination. Further, we studied the effects of varying the batch
data on every iteration during the training. From the experimental results
using the TUT sound events synthetic 2016 database, we obtained optimal
performance with a learning rate of . We found that a longer input
segment length aided performance improvement, and batch normalization
was far more effective than dropout. Finally, performance improvement was
clearly observed by varying the starting points of the batch data for each iteration
during the training.
Keywords:
Acoustic event detection
Convolutional neural network
Convolutional recurrent neural
network
Hyper-parameters
Recurrent neural network
This is an open access article under the CC BY-SA license.
Corresponding Author:
Yong-Joo Chung,
Department of Electronics Engineering,
Keimyung University,
1095 Dalgubul-Daero, Dalseo-Gu, Daegu, South Korea.
Email: yjjung@kuu.ac.kr
1. INTRODUCTION
Recently, there has been increasing research interests in acoustic event detection, where the existence
and occurrence times of the various sounds in our daily lives are identified. There are many applications for
acoustic event detecton, including surveillance [1, 2], urban sound analysis [3, 4], information retrieval from
multimedia contents [5], health care monitoring [6-7] and bird call detection [8, 9]. Deep neural networks
(DNNs) have demonstrated superior performance to conventional machine learning techniques in image
classification [10-12], speech recognition [13-15], and machine translation [16, 17]. In [18], we see that
the feedforward neural network (FNN) now outperforms the Gaussian mixture model (GMM) and support
vector machine (SVM), which have traditionally been employed for acoustic event detection. FNN has
also been shown to outperform the conventional GMM-HMM-based methods in polyphonic acoustic
event detection [19]. Therefore, we can say that current studies on acoustic event detection mainly focuse on
DNN-based approaches.
However, due to the fixed connection between the input and hidden layers, FNN is apparently
inadequate to overcome the signal distortions in image classification. Similarly, it is apparent that FNN
is also insufficient for acoustic event detection as audio signal distortions are frequently encountered in
the 2-dimensional time-frequency domain of the signal. Another problem with FNN lies in modeling the correlation
between the time-frames of the audio signal. As FNN only concatenates several input frames together to model
the time correlation, it often fails to model the long-term time correlations of the audio signal.
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1388
Convolutional neural networks (CNNs) can alleviate the limitation of FNN using 2-dimensional
filters, whose parameters are shared along the time and frequency shift [20], and it has exhibited superior
performance to FNN in various pattern recognition tasks. Particularly, due to its structural characteristics,
CNN can efficiently handle image distortions in the 2-dimensional space [10]. Similarly, we expect that
the time-frequency domain distortions occurring in the audio signal can be accurately modeled by CNN.
Nevertheless, CNN is inefficient for modeling long-term time correlations between audio signal samples.
Recurrent neural networks (RNNs) have been used successfully in speech recognition [13],
and are superior to other neural networks in modeling the time-correlation information of the time-series
signals such as speech and audio. However, because RNN is unable to tackle 2-dimensional distortions in
the time-frequency domain of the audio signal, its performance is usually inferior to CNN when used alone in
acoustic event detection. Recently, there have been some approaches to combine CNN and RNN for their
combined merits. Among them, convolutional recurrent neural networks (CRNNs) have been used
successfully for acoustic event detection [21], speech recognition [22], and music classification [23].
As CRNN is constructed by connecting CNN, RNN, and FNN in series, it is more complex than other neural
networks; therefore, the combined effect of the networks is difficult to predict. Moreover, as the use of CRNN
on acoustic event detection is in its early stages, there are few research studies on optimizing the various
hyper-parameters of CRNN.
Therefore, we attempted to find the optimal hyper-parameters of CRNN for acoustic event
detection in this study. Several experiments were performed to identify the optimal hyper-parameters of
CRNN, and we used the test results on the validation data to determine the hyper-parameters.
Important hyper-parameters, such as the input segment length, learning rate, and criterion for
the convergence test were determined from the experiments. Additionally, the effects of batch normalization
and dropout on the performance were observed. We also studied the effects of varying the batch data in every
iteration during the training. This paper is organized as follows. In Section 2, we introduce the feature
extraction method for audio signals as well as the architecture of the CRNN used for acoustic event detection.
In Section 3, we present and discuss various experimental results, and conclusions are given in Section 4.
2. FEATURE EXTRACTION AND CRNN ARCHITECTURE
2.1. Feature extraction
In this study, we used log-mel filterbank (LMFB) outputs as input features for CRNN and the entire
process of feature extraction is shown in Figure 1. We first computed short-time Fourier transform (STFT)
from the 40-ms audio signals, which are sampled at 44.1 KHz. STFTs were computed at every 20 ms with
50% overlap [21]. Further, 40-dimensional mel filterbanks were extracted from the STFTs spanning 0 to
22050 Hz, and they were log-transformed to obtain the LMFBs, which are normalized by subtracting
the mean and dividing by standard deviation of the entire training data.
Figure 1. Extraction process of log-mel filterbank (LMFB)
2.2. The architecture of CRNN
Figure 2 presents the architecture of the CRNN used in this paper. CRNN consists of CNNs
followed by RNN and FNN in sequence. The CNNs act as audio feature extractors, which are robust against
distortions in the time-frequency domain. The RNN utilize the time-correlation information of the audio
signals. Finally, the FNN serves as an output layer, which produces the posterior probabilities for each sound class at
each time frame.
As CNN takes 40-dimensional LMFBs as input features, the dimension of the input of the CNN
is Tx40, where the length of the input segment is set to T. The CNN consists of 3 convolution layers, each of
which has 256 feature maps with 5 5 filters. The output of the filter is processed by batch normalization
and then passes through ReLU activation function. To maintain the time-domain dimension, max pooling
is performed only in the frequency-domain. Further, dropout is applied at a rate of 0.25 after the max pooling
layer [10]. The input segment length T is set to 1024 frames (20.48s). We experimented with different values
of T to find the one that produced the best result on the validation dataset.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1389
The output of the CNN is input to the RNN, which consists of 256 gated recurrent units (GRUs).
The output layer consists of K units with sigmoid activation function, where K represents the number of
acoustic event classes. The sigmoid activation function produces the posterior probabilities for each class at
each time frame, from which we decide whether an acoustic event is active based on a threshold (0.5).
Figure 2. Architecture of the CRNN used for acoustic event detection
3. EXPERIMENTAL RESULTS
3.1. Database and evaluation metric
To evaluate the performance of CRNN on acoustic event detection, we used the TUT sound events
synthetic 2016 (TUT-SED Synthetic) database, which is popularly used in this area [24]. TUT-SED Synthetic
contains artificially generated audio data, because it is difficult to obtain enough data using only audio data
recorded in real environments. Moreover, the subjective labelling error can be mitigated by artificial data. TUT-SED
Synthetic was generated by mixing isolated sound events from 16 different classes. The total length of the data is 566
minutes, which were divided into training, testing, and validation data with proportions of 60%, 20%,
and 20%, respectively. Segments of length 3-15 seconds were used for the training, testing, and validation,
and there were no common acoustic event instances between them. The detailed sound classes and their total
duration in the database are shown in Table 1.
For the evaluation metric of the acoustic event detection, we used both error rate (ER) and F-score [25].
We adopted two types of evaluation methods:, segment and event-based. In the segment-based method,
the binarized outputs of the CRNN are compared with the ground truth table in every segment of length 1 s.
In the event-based method, the output of the CRNN are compared with the label in the ground truth table
whenever an acoustic event has been detected by CRNN [25]. We sought the optimal hyper-parameters of
the CRNN by applying various conditions during the training. To find the optimal learning rate,
we experimented as we changed the learning rate. The results are shown in Table 2.
We applied batch normalization and dropout in all cases, and binary cross entropy was employed as
the loss function, which acts as the criterion for the convergence of the weights. The Adam optimizer was
used to optime the neural networks. As seen in Table 2, the optimal CRNN performance was achieved at
a learning rate . Generally, as the optimal learning rate for the neural networks is not pre-determined,
and varies by both network architecture and amount of training data, we searched for the optimal learning
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1390
rate by observing the performance of the validation data set. We can see in Table 2 that the best learning rate
for the testing data is same as the learning rate that achieves the best performance on the validation data.
This indicates that it is reasonable to determine the optimal learning rate of CRNN by its performance on
the validation data. Table 2 also shows that as the learning rate decreases, the number of epochs that shows
the best performance increases. For example, the number of epochs is 33 when the learning rate is ,
whereas it increases to 191 with the learning rate of . The larger number of epochs indicates a slower
convergence of the CRNN parameters, which causes performance degradation due to the underfitting of the neural
networks. In contrast, when the learning increases to , the number of epochs decreases dramatically to 16,
which causes overfitting and results in performance degradation.
To investigate the performance variation with the learning rate further, we show, (as shown in Figure 3),
the variation of the loss function and accuracy at the output of the CRNN during the training as the learning
rate varied from to . When the learning rate is , it can be seen that the loss function on
the validation data reaches its minimum at approximately 30 epochs (33 precisely) and subsequently
fluctuates (but never falls below the minimum). However, for the training data, the loss function decreases
from the beginning to the end of the training (we set the maximum number of epoch to 200). As it
is important for the networks not to be overfitted, we stopped the iteration at 33 epochs by the early stopping
algorithm mentioned previosuly. Meanwhile, we see different characteristics when the learning rate is .
The loss function on the validation data decreases for longer and reaches its minimum at 157. The longer
iterations contribute to decrease the performance of the CRNN with both validation and test data due to
the underfitting problem. This phenomenon becomes more pronounced as we further decrease the learning rate.
When the learning rate is , the loss function does not reach its minimum until the end of the training. A similar
trend is observed when we monitor the accuracy instead of the loss function.
We investigated the effect on performance as we changed the input batch data in every epoch of
the training. The starting points of each batch data were shifted at each epoch making the input segments at
successive epochs differ by the shift-length. For the evaluation, both segment-based and event-based methods
were used and the results are shown in F-score and ER. We used ER as the convergence criterion for
the training. Further, early stopping was employed where we stopped the training when the convergence
criterion did not improve for more than 100 epochs on the validation data. This was to prevent overfitting,
and the maximum number of epochs was set to 200. We also investigated the effects of batch normalization (BN)
and dropout. BN has been used to mitigate the vanishing and exploding gradient problems in the backpropagation
algorithm, and we used BN before the ReLU activation function in the convolution layer. Dropout is a popular
regularization method in the neural networks, which is used to exclude the neurons from training at
a predefined probability (dropout rate). In this study, dropout is used in all layers of the CNN and RNN (but not
the FNN [18].
In Table 3, the results of using the shift of batch data are presented. In the segment-based evaluation,
we see an improved average F-score/ER of 63.18%/0.52 using the shift compared to the F-score/ER
of 62.11%/0.54 without the shift (non-shift). We also observe slight performance improvement in
the event-based evaluation. From these results, we confirm that superior performance is expected by
the batch data shift for training the CRNN.
Table 3 also shows the effects of BN and dropout on performance. Expectedly, the best average F-score/ER
is obtained when applying both BN and dropout (56.67%/0.71). If we apply only BN without dropout, we see
slight performance degradation (54.54%/0.79), which implies that the effect of dropout on performances
is not significant. Meanwhile, if we do not apply BN, the performance degrades significantly (regardless of
whether we apply dropout), and the poorest result is obtained when we apply neither BN nor dropout
(47.15%/0.81). In Table 4, we compare the performances of the CRNN between two convergence criteria (ER and
F-score) for training. BN and dropout were applied and the overlap method was used. In the table, we observe
slight performance improvementa with ER, but the performance difference appears negligible, and we
conclude that the ER and F-scores can be used as the convergence criteria without significantly affecting
the performance.
Table 1. Sound classes and their total duration in seconds on the TUT-SED synthetic 2016 databases
Classes Duration(s) Classes Duration(s)
Glass 621 Motor cycle 3691
Gun 534 Foot steps 1173
Cat 941 Claps 3278
Dog 716 Bus 3464
Thunder 3007 Mixer 4020
Bird 2298 Crowd 4825
Horse 1614 Alarm 4405
Crying 2007 Rain 3975
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1391
Table 2. Performance of CRNN as learning rate changes
Leaning rate Validation data (F-score/ER) Testing data (F-score/ER) Epoch
Segment Event Segment Event
61.69%/ 0.52 37.69%/0.96 60.61%/0.53 37.05%/0.97 16
68.75%/ 0.45 43.49%/0.88 64.21%/0.50 40.50%/0.96 33
66.44%/ 0.49 39.10%/0.96 63.76%/0.52 36.48%/1.04 157
44.16%/ 0.69 9.83%/1.24 43.38%/0.71 10.82%/1.27 191
Figure 3. Variation of loss function and accuracy as learning rate changes
Table 3. Performance of CRNN with various training conditions
BN Drop out Segment-based (F-score/ER) Event-based (F-score/ER) Average
Shift Non-shift Shift Non-shift
Yes No 66.10%/0.48 65.28%/0.54 43.80%/1.01 42.99%/1.12 54.54%/0.79
Yes Yes 67.24%/0.49 66.62%/0.49 45.93%/0.94 46.88%/0.92 56.67%/0.71
No No 58.58%/0.57 58.88%/0.58 35.47%/1.06 35.68%/1.04 47.15%/0.81
No Yes 60.80%/0.54 57.67%/0.56 40.02%/0.97 39.18%/1.00 49.4%/0.77
Average 63.18%/0.52 62.11%/0.54 41.3%/1.0 41.18%/1.02
Table 4. Performance of CRNN depending on the convergence criterion
Convergence Criterion Segment-based (F-score/ER) Event-based (F-score/ER) Average
ER 67.24%/0.49 45.93%/0.94 56.59%/0.72
F-score 66.45%/0.51 44.97%/0.98 55.71%/0.75
Finally, in Table 5, we show performance variation as we changed the length of the input segment of
CRNN. Compared to shorter lengths (2.56s, 5.12s), longer length segments (10.24s, 20.56s) show better
performances. This may be due to the fact that many of acoustic events in the TUT-SED Synthetic have long
lengths and the RNN can efficiently capture the time-correlations in the long segments.
Table 5. Performance of CRNN depending on the length of the input segment
SegmentLength (s) Segment-based (F-scores/ER) Event-based (F-scorse/ER) Average
2.56 66.84%/0.51 42.75%/1.13 54.80%/0.82
5.12 67.47%/0.50 44.27%/1.04 55.87%/0.77
10.24 68.01%/0.47 45.33%/0.97 56.67%/0.72
20.56 67.24%/0.49 45.93%/0.94 56.54%/0.70
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393
1392
4. CONCLUSION
For acoustic event detection, approaches based on deep neural networks have demonstrated
superior performances to conventional machine learning methods, such as GMM and SVM. Among them,
CRNN is thought to be well suited for acoustic event detection due to its ability to reduce signal distortions in
the time-frequency domain as well as in exploiting the temporal-correlation information of the audio signal.
In this study, we empolyed CRNN as the classifier for the acoustic event detection, and several of
its conditions were tested by extensive experiments to determine its optimal hyper-parameters.
In the experiments, by varying the learning rate, we found that the optimum performance is obtained
when the learning rate is set to . From the results, we could also see that the learning rate that exhibits
optimum performance on the validation data also performs best in the testing data. This suggests that it
is reasonable to determine the optimal learning rate based on performance tests on the validation data.
We further confirmed that BN and dropout contributed to improving the performance of CRNN.
Particularly, BN had a larger impact on the performance improvement than dropout.
Instead of using identical batch data at every iteration, we obtained improved performance by
changing the batch data in every iteration, which resulted from increasing the number of training samples for
CRNN. We further found that the length of the input segments of the CRNN also affects the performance.
We obtained better performance using longer segments, as the acoustic event used in this paper had relatively
long time-durations.
ACKNOWLEDGEMENTS
This research was supported by Basic Science Research Program through the National Research
Foundation of Korea (NRF) funded by Ministry of Education (No. 2018R1A2B6009328).
REFERENCES
[1] M. K. Nandwana, et al., “Robust Unsupervised Detection of Human Screams In Noisy Acoustic Environments,”
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD,
pp. 161-165, 2015.
[2] M. Crocco, et al., “Audio Surveillance: A Systematic Review,” ACM Computing Surveys, no. 52, 2016.
[3] J. Salamon and J. P. Bello, “Feature Learning with Deep Scattering for Urban Sound Analysis,” 2015 23rd
European Signal Processing Conference (EUSIPCO), Nice, pp. 724-728, 2015.
[4] S. Ntalampiras, et al., "On Acoustic Surveillance of Hazardous Situations", 2009 IEEE International Conference
on Acoustics, Speech and Signal Processing, Taipei, pp. 165-168, 2009.
[5] Y. Wang, et al., “Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks,” 2016 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 2742-2746, 2016.
[6] D. Stowell and D. Clayton, “Acoustic Event Detection for Multiple Overlapping Similar Sources 2015 IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, pp. 1-5, 2015.
[7] G. Dekkers, et al., “DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel
acoustics,” KU Leuven, Tech. Rep., 2018.
[8] D. Stowell, et al., “Automatic Acoustic Detection of Birds through Deep Learning: the First Bird Audio Detection
Challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp.1-21, 2018.
[9] F. Briggs, et al., “Acoustic Classification of Multiple Simultaneous Bird Species: A Multi-instance Multi-Label
Approach,” The Journal of the Acoustical Society of America, vol. 131, no. 6, pp.4640-4640, 2012.
[10] A. Krizhevsky, et al., “Imagenet Classification with Deep Convolutional Neural Networks,” Advances in Neural
Information Processing Systems, vol. 25, no. 2, pp. 1097-1105, 2012.
[11] K. He, et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, pp. 770-778, 2016.
[12] Olga Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of
Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
[13] A. Graves, et al., “Speech Recognition with Deep Recurrent Neural Networks,” 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 6645-6649, 2013.
[14] J. Chorowski, et al., “Attention-Based Models for Speech Recognition,” in Proceedings of the 28th International
Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577-585, 2015.
[15] A. Hannun, et al., “Deep Speech: Scaling up End-to-End Speech Recognition,” arXiv: 1412.5567, 2014.
[16] K. Cho, et al., “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation,”
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1724-1734, 2014.
[17] I. Sutskever, et al., “Sequence to Sequence Learning with Neural Networks”, in Proceedings of the 27th
International Conference on Neural Information Processing Systems (NIPS), pp. 3104-3112, 2014.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung)
1393
[18] S. H. Chung and Y. J. Chung, “Comparison of Audio Event Detection Performance using DNN,” Journal of the
Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, pp. 571-577, 2018.
[19] E. Cakir, et al., “Polyphonic sound event detection using multilabel deep neural networks,” 2015 International
Joint Conference on Neural Networks (IJCNN), Killarney, pp. 1-7, 2015.
[20] Y. LeCun, et al., “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278-2324, 1998.
[21] E. Cakir, et al., “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, 2017.
[22] T. N. Sainath, et al., “Convolutional, Long Short-term Memory, Fully Connected Deep Neural Networks,” 2015
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD,
pp. 4580-4584, 2015.
[23] K. Choi, et al., “Convolutional Recurrent Neural Networks for Music Classification,” 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 2392-2396, 2017.
[24] “TUT-SED Synthetic,” 2016. [Online]. Available at: http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016
[25] A. Mesaros, et al., “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6,
pp. 162-178, 2016.
BIOGRAPHIES OF AUTHORS
Sukwhan Jung received his B.Sc. and M.Sc. Degree in Electronics Engineering from Keimyung
University, Daegu, South Korea in 2016 and 2018, respectively. He has been with Samju
Electroincs Co. since March 2018. His main research interests are audio event detection under
noisy environments and deep learning for artificial intelligence.
Yongjoo Chung received his B.Sc. degree in Electronics Engineering from Seoul National
University, Seoul, South Korea in 1988. He earned his M.Sc. and PhD degree in Electrical and
Electronics Engineering from Korea Advanced Science and Technology, Daejon, South Korean
in 1995. He is currently a Professor with the Department of Electronics Engineering at
Keimyung University, Daegu, S. Korea. His research interests are in the areas of speech
recognition, audio event detection, machine learning and pattern recognition.

More Related Content

What's hot

The influence of sampling frequency on tone recognition of musical instruments
The influence of sampling frequency on tone recognition of musical instrumentsThe influence of sampling frequency on tone recognition of musical instruments
The influence of sampling frequency on tone recognition of musical instrumentsTELKOMNIKA JOURNAL
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodCSCJournals
 
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...inventionjournals
 
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...IJEACS
 
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...sipij
 
Depth Estimation of Sound Images Using Directional Clustering and Activation...
Depth Estimation of Sound Images Using  Directional Clustering and Activation...Depth Estimation of Sound Images Using  Directional Clustering and Activation...
Depth Estimation of Sound Images Using Directional Clustering and Activation...奈良先端大 情報科学研究科
 
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
Online Divergence Switching for  Superresolution-Based  Nonnegative Matrix Fa...Online Divergence Switching for  Superresolution-Based  Nonnegative Matrix Fa...
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...奈良先端大 情報科学研究科
 
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)NAVER Engineering
 
An Approach to Spectrum Sensing in Cognitive Radio
An Approach to Spectrum Sensing in Cognitive Radio An Approach to Spectrum Sensing in Cognitive Radio
An Approach to Spectrum Sensing in Cognitive Radio IOSR Journals
 
FPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCFPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCTELKOMNIKA JOURNAL
 
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET Journal
 
Spectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate ConverterSpectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate ConverterCSCJournals
 
Dwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
Dwpt Based FFT and Its Application to SNR Estimation in OFDM SystemsDwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
Dwpt Based FFT and Its Application to SNR Estimation in OFDM SystemsCSCJournals
 
Stability Analysis in a Cognitive Radio System with Cooperative Beamforming
Stability Analysis in a Cognitive Radio System with Cooperative BeamformingStability Analysis in a Cognitive Radio System with Cooperative Beamforming
Stability Analysis in a Cognitive Radio System with Cooperative BeamformingMohamed Seif
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
 
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...
Single User Eigenvalue Based Detection For Spectrum Sensing In  Cognitive Rad...Single User Eigenvalue Based Detection For Spectrum Sensing In  Cognitive Rad...
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...IJMER
 
Noise uncertainty effect on multi-channel cognitive radio networks
Noise uncertainty effect on multi-channel cognitive  radio networks Noise uncertainty effect on multi-channel cognitive  radio networks
Noise uncertainty effect on multi-channel cognitive radio networks IJECEIAES
 

What's hot (19)

The influence of sampling frequency on tone recognition of musical instruments
The influence of sampling frequency on tone recognition of musical instrumentsThe influence of sampling frequency on tone recognition of musical instruments
The influence of sampling frequency on tone recognition of musical instruments
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
 
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
 
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
Heterogeneous Spectrum Sensing in Cognitive Radio Network using Traditional E...
 
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
SHIPS MATCHING BASED ON AN ADAPTIVE ACOUSTIC SPECTRUM SIGNATURE DETECTION ALG...
 
Depth Estimation of Sound Images Using Directional Clustering and Activation...
Depth Estimation of Sound Images Using  Directional Clustering and Activation...Depth Estimation of Sound Images Using  Directional Clustering and Activation...
Depth Estimation of Sound Images Using Directional Clustering and Activation...
 
GSM
GSMGSM
GSM
 
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
Online Divergence Switching for  Superresolution-Based  Nonnegative Matrix Fa...Online Divergence Switching for  Superresolution-Based  Nonnegative Matrix Fa...
Online Divergence Switching for Superresolution-Based Nonnegative Matrix Fa...
 
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
 
An Approach to Spectrum Sensing in Cognitive Radio
An Approach to Spectrum Sensing in Cognitive Radio An Approach to Spectrum Sensing in Cognitive Radio
An Approach to Spectrum Sensing in Cognitive Radio
 
FPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCFPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCC
 
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
 
Spectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate ConverterSpectral Analysis of Sample Rate Converter
Spectral Analysis of Sample Rate Converter
 
Dwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
Dwpt Based FFT and Its Application to SNR Estimation in OFDM SystemsDwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
Dwpt Based FFT and Its Application to SNR Estimation in OFDM Systems
 
Stability Analysis in a Cognitive Radio System with Cooperative Beamforming
Stability Analysis in a Cognitive Radio System with Cooperative BeamformingStability Analysis in a Cognitive Radio System with Cooperative Beamforming
Stability Analysis in a Cognitive Radio System with Cooperative Beamforming
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home Applications
 
D04812125
D04812125D04812125
D04812125
 
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...
Single User Eigenvalue Based Detection For Spectrum Sensing In  Cognitive Rad...Single User Eigenvalue Based Detection For Spectrum Sensing In  Cognitive Rad...
Single User Eigenvalue Based Detection For Spectrum Sensing In Cognitive Rad...
 
Noise uncertainty effect on multi-channel cognitive radio networks
Noise uncertainty effect on multi-channel cognitive  radio networks Noise uncertainty effect on multi-channel cognitive  radio networks
Noise uncertainty effect on multi-channel cognitive radio networks
 

Similar to Performance analysis of the convolutional recurrent neural network on acoustic event detection

Attention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingAttention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingIAESIJAI
 
Frequency based criterion for distinguishing tonal and noisy spectral components
Frequency based criterion for distinguishing tonal and noisy spectral componentsFrequency based criterion for distinguishing tonal and noisy spectral components
Frequency based criterion for distinguishing tonal and noisy spectral componentsCSCJournals
 
Features selection by genetic algorithm optimization with k-nearest neighbour...
Features selection by genetic algorithm optimization with k-nearest neighbour...Features selection by genetic algorithm optimization with k-nearest neighbour...
Features selection by genetic algorithm optimization with k-nearest neighbour...IJECEIAES
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement techniqueeSAT Publishing House
 
Adaptive wavelet thresholding with robust hybrid features for text-independe...
Adaptive wavelet thresholding with robust hybrid features  for text-independe...Adaptive wavelet thresholding with robust hybrid features  for text-independe...
Adaptive wavelet thresholding with robust hybrid features for text-independe...IJECEIAES
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
General Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker IdentificationGeneral Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker Identificationijcisjournal
 
Intelligent Arabic letters speech recognition system based on mel frequency c...
Intelligent Arabic letters speech recognition system based on mel frequency c...Intelligent Arabic letters speech recognition system based on mel frequency c...
Intelligent Arabic letters speech recognition system based on mel frequency c...IJECEIAES
 
IRJET- Music Genre Classification using MFCC and AANN
IRJET- Music Genre Classification using MFCC and AANNIRJET- Music Genre Classification using MFCC and AANN
IRJET- Music Genre Classification using MFCC and AANNIRJET Journal
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET Journal
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
 
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...cscpconf
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
 
Recognition of music genres using deep learning.
Recognition of music genres using deep learning.Recognition of music genres using deep learning.
Recognition of music genres using deep learning.IRJET Journal
 
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...IJCSEA Journal
 

Similar to Performance analysis of the convolutional recurrent neural network on acoustic event detection (20)

Attention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingAttention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoising
 
Frequency based criterion for distinguishing tonal and noisy spectral components
Frequency based criterion for distinguishing tonal and noisy spectral componentsFrequency based criterion for distinguishing tonal and noisy spectral components
Frequency based criterion for distinguishing tonal and noisy spectral components
 
N017428692
N017428692N017428692
N017428692
 
Features selection by genetic algorithm optimization with k-nearest neighbour...
Features selection by genetic algorithm optimization with k-nearest neighbour...Features selection by genetic algorithm optimization with k-nearest neighbour...
Features selection by genetic algorithm optimization with k-nearest neighbour...
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
Adaptive wavelet thresholding with robust hybrid features for text-independe...
Adaptive wavelet thresholding with robust hybrid features  for text-independe...Adaptive wavelet thresholding with robust hybrid features  for text-independe...
Adaptive wavelet thresholding with robust hybrid features for text-independe...
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
General Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker IdentificationGeneral Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker Identification
 
G017513945
G017513945G017513945
G017513945
 
1801 1805
1801 18051801 1805
1801 1805
 
1801 1805
1801 18051801 1805
1801 1805
 
Intelligent Arabic letters speech recognition system based on mel frequency c...
Intelligent Arabic letters speech recognition system based on mel frequency c...Intelligent Arabic letters speech recognition system based on mel frequency c...
Intelligent Arabic letters speech recognition system based on mel frequency c...
 
IRJET- Music Genre Classification using MFCC and AANN
IRJET- Music Genre Classification using MFCC and AANNIRJET- Music Genre Classification using MFCC and AANN
IRJET- Music Genre Classification using MFCC and AANN
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
IRJET- Machine Learning and Noise Reduction Techniques for Music Genre Classi...
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
 
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker Recognition
 
Recognition of music genres using deep learning.
Recognition of music genres using deep learning.Recognition of music genres using deep learning.
Recognition of music genres using deep learning.
 
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
 

More from journalBEEI

Square transposition: an approach to the transposition process in block cipher
Square transposition: an approach to the transposition process in block cipherSquare transposition: an approach to the transposition process in block cipher
Square transposition: an approach to the transposition process in block cipherjournalBEEI
 
Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...journalBEEI
 
Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
 
A secure and energy saving protocol for wireless sensor networks
A secure and energy saving protocol for wireless sensor networksA secure and energy saving protocol for wireless sensor networks
A secure and energy saving protocol for wireless sensor networksjournalBEEI
 
Plant leaf identification system using convolutional neural network
Plant leaf identification system using convolutional neural networkPlant leaf identification system using convolutional neural network
Plant leaf identification system using convolutional neural networkjournalBEEI
 
Customized moodle-based learning management system for socially disadvantaged...
Customized moodle-based learning management system for socially disadvantaged...Customized moodle-based learning management system for socially disadvantaged...
Customized moodle-based learning management system for socially disadvantaged...journalBEEI
 
Understanding the role of individual learner in adaptive and personalized e-l...
Understanding the role of individual learner in adaptive and personalized e-l...Understanding the role of individual learner in adaptive and personalized e-l...
Understanding the role of individual learner in adaptive and personalized e-l...journalBEEI
 
Prototype mobile contactless transaction system in traditional markets to sup...
Prototype mobile contactless transaction system in traditional markets to sup...Prototype mobile contactless transaction system in traditional markets to sup...
Prototype mobile contactless transaction system in traditional markets to sup...journalBEEI
 
Wireless HART stack using multiprocessor technique with laxity algorithm
Wireless HART stack using multiprocessor technique with laxity algorithmWireless HART stack using multiprocessor technique with laxity algorithm
Wireless HART stack using multiprocessor technique with laxity algorithmjournalBEEI
 
Implementation of double-layer loaded on octagon microstrip yagi antenna
Implementation of double-layer loaded on octagon microstrip yagi antennaImplementation of double-layer loaded on octagon microstrip yagi antenna
Implementation of double-layer loaded on octagon microstrip yagi antennajournalBEEI
 
The calculation of the field of an antenna located near the human head
The calculation of the field of an antenna located near the human headThe calculation of the field of an antenna located near the human head
The calculation of the field of an antenna located near the human headjournalBEEI
 
Exact secure outage probability performance of uplinkdownlink multiple access...
Exact secure outage probability performance of uplinkdownlink multiple access...Exact secure outage probability performance of uplinkdownlink multiple access...
Exact secure outage probability performance of uplinkdownlink multiple access...journalBEEI
 
Design of a dual-band antenna for energy harvesting application
Design of a dual-band antenna for energy harvesting applicationDesign of a dual-band antenna for energy harvesting application
Design of a dual-band antenna for energy harvesting applicationjournalBEEI
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...journalBEEI
 
Key performance requirement of future next wireless networks (6G)
Key performance requirement of future next wireless networks (6G)Key performance requirement of future next wireless networks (6G)
Key performance requirement of future next wireless networks (6G)journalBEEI
 
Noise resistance territorial intensity-based optical flow using inverse confi...
Noise resistance territorial intensity-based optical flow using inverse confi...Noise resistance territorial intensity-based optical flow using inverse confi...
Noise resistance territorial intensity-based optical flow using inverse confi...journalBEEI
 
Modeling climate phenomenon with software grids analysis and display system i...
Modeling climate phenomenon with software grids analysis and display system i...Modeling climate phenomenon with software grids analysis and display system i...
Modeling climate phenomenon with software grids analysis and display system i...journalBEEI
 
An approach of re-organizing input dataset to enhance the quality of emotion ...
An approach of re-organizing input dataset to enhance the quality of emotion ...An approach of re-organizing input dataset to enhance the quality of emotion ...
An approach of re-organizing input dataset to enhance the quality of emotion ...journalBEEI
 
Parking detection system using background subtraction and HSV color segmentation
Parking detection system using background subtraction and HSV color segmentationParking detection system using background subtraction and HSV color segmentation
Parking detection system using background subtraction and HSV color segmentationjournalBEEI
 
Quality of service performances of video and voice transmission in universal ...
Quality of service performances of video and voice transmission in universal ...Quality of service performances of video and voice transmission in universal ...
Quality of service performances of video and voice transmission in universal ...journalBEEI
 

More from journalBEEI (20)

Square transposition: an approach to the transposition process in block cipher
Square transposition: an approach to the transposition process in block cipherSquare transposition: an approach to the transposition process in block cipher
Square transposition: an approach to the transposition process in block cipher
 
Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...Hyper-parameter optimization of convolutional neural network based on particl...
Hyper-parameter optimization of convolutional neural network based on particl...
 
Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...Supervised machine learning based liver disease prediction approach with LASS...
Supervised machine learning based liver disease prediction approach with LASS...
 
A secure and energy saving protocol for wireless sensor networks
A secure and energy saving protocol for wireless sensor networksA secure and energy saving protocol for wireless sensor networks
A secure and energy saving protocol for wireless sensor networks
 
Plant leaf identification system using convolutional neural network
Plant leaf identification system using convolutional neural networkPlant leaf identification system using convolutional neural network
Plant leaf identification system using convolutional neural network
 
Customized moodle-based learning management system for socially disadvantaged...
Customized moodle-based learning management system for socially disadvantaged...Customized moodle-based learning management system for socially disadvantaged...
Customized moodle-based learning management system for socially disadvantaged...
 
Understanding the role of individual learner in adaptive and personalized e-l...
Understanding the role of individual learner in adaptive and personalized e-l...Understanding the role of individual learner in adaptive and personalized e-l...
Understanding the role of individual learner in adaptive and personalized e-l...
 
Prototype mobile contactless transaction system in traditional markets to sup...
Prototype mobile contactless transaction system in traditional markets to sup...Prototype mobile contactless transaction system in traditional markets to sup...
Prototype mobile contactless transaction system in traditional markets to sup...
 
Wireless HART stack using multiprocessor technique with laxity algorithm
Wireless HART stack using multiprocessor technique with laxity algorithmWireless HART stack using multiprocessor technique with laxity algorithm
Wireless HART stack using multiprocessor technique with laxity algorithm
 
Implementation of double-layer loaded on octagon microstrip yagi antenna
Implementation of double-layer loaded on octagon microstrip yagi antennaImplementation of double-layer loaded on octagon microstrip yagi antenna
Implementation of double-layer loaded on octagon microstrip yagi antenna
 
The calculation of the field of an antenna located near the human head
The calculation of the field of an antenna located near the human headThe calculation of the field of an antenna located near the human head
The calculation of the field of an antenna located near the human head
 
Exact secure outage probability performance of uplinkdownlink multiple access...
Exact secure outage probability performance of uplinkdownlink multiple access...Exact secure outage probability performance of uplinkdownlink multiple access...
Exact secure outage probability performance of uplinkdownlink multiple access...
 
Design of a dual-band antenna for energy harvesting application
Design of a dual-band antenna for energy harvesting applicationDesign of a dual-band antenna for energy harvesting application
Design of a dual-band antenna for energy harvesting application
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...
 
Key performance requirement of future next wireless networks (6G)
Key performance requirement of future next wireless networks (6G)Key performance requirement of future next wireless networks (6G)
Key performance requirement of future next wireless networks (6G)
 
Noise resistance territorial intensity-based optical flow using inverse confi...
Noise resistance territorial intensity-based optical flow using inverse confi...Noise resistance territorial intensity-based optical flow using inverse confi...
Noise resistance territorial intensity-based optical flow using inverse confi...
 
Modeling climate phenomenon with software grids analysis and display system i...
Modeling climate phenomenon with software grids analysis and display system i...Modeling climate phenomenon with software grids analysis and display system i...
Modeling climate phenomenon with software grids analysis and display system i...
 
An approach of re-organizing input dataset to enhance the quality of emotion ...
An approach of re-organizing input dataset to enhance the quality of emotion ...An approach of re-organizing input dataset to enhance the quality of emotion ...
An approach of re-organizing input dataset to enhance the quality of emotion ...
 
Parking detection system using background subtraction and HSV color segmentation
Parking detection system using background subtraction and HSV color segmentationParking detection system using background subtraction and HSV color segmentation
Parking detection system using background subtraction and HSV color segmentation
 
Quality of service performances of video and voice transmission in universal ...
Quality of service performances of video and voice transmission in universal ...Quality of service performances of video and voice transmission in universal ...
Quality of service performances of video and voice transmission in universal ...
 

Recently uploaded

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Recently uploaded (20)

Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 

Performance analysis of the convolutional recurrent neural network on acoustic event detection

  • 1. Bulletin of Electrical Engineering and Informatics Vol. 9, No. 4, August 2020, pp. 1387~1393 ISSN: 2302-9285, DOI: 10.11591/eei.v9i4.2230  1387 Journal homepage: http://beei.org Performance analysis of the convolutional recurrent neural network on acoustic event detection Suk-Hwan Jung, Yong-Joo Chung Department of Electronics Engineering, Keimyung University, South Korea Article Info ABSTRACT Article history: Received Nov 11, 2019 Revised Feb 3, 2020 Accepted Mar 23, 2020 In this study, we attempted to find the optimal hyper-parameters of the convolutional recurrent neural network (CRNN) by investigating its performance on acoustic event detection. Important hyper-parameters such as the input segment length, learning rate, and criterion for the convergence test, were determined experimentally. Additionally, the effects of batch normalization and dropout on the performance were measured experimentally to obtain their optimal combination. Further, we studied the effects of varying the batch data on every iteration during the training. From the experimental results using the TUT sound events synthetic 2016 database, we obtained optimal performance with a learning rate of . We found that a longer input segment length aided performance improvement, and batch normalization was far more effective than dropout. Finally, performance improvement was clearly observed by varying the starting points of the batch data for each iteration during the training. Keywords: Acoustic event detection Convolutional neural network Convolutional recurrent neural network Hyper-parameters Recurrent neural network This is an open access article under the CC BY-SA license. Corresponding Author: Yong-Joo Chung, Department of Electronics Engineering, Keimyung University, 1095 Dalgubul-Daero, Dalseo-Gu, Daegu, South Korea. Email: yjjung@kuu.ac.kr 1. INTRODUCTION Recently, there has been increasing research interests in acoustic event detection, where the existence and occurrence times of the various sounds in our daily lives are identified. There are many applications for acoustic event detecton, including surveillance [1, 2], urban sound analysis [3, 4], information retrieval from multimedia contents [5], health care monitoring [6-7] and bird call detection [8, 9]. Deep neural networks (DNNs) have demonstrated superior performance to conventional machine learning techniques in image classification [10-12], speech recognition [13-15], and machine translation [16, 17]. In [18], we see that the feedforward neural network (FNN) now outperforms the Gaussian mixture model (GMM) and support vector machine (SVM), which have traditionally been employed for acoustic event detection. FNN has also been shown to outperform the conventional GMM-HMM-based methods in polyphonic acoustic event detection [19]. Therefore, we can say that current studies on acoustic event detection mainly focuse on DNN-based approaches. However, due to the fixed connection between the input and hidden layers, FNN is apparently inadequate to overcome the signal distortions in image classification. Similarly, it is apparent that FNN is also insufficient for acoustic event detection as audio signal distortions are frequently encountered in the 2-dimensional time-frequency domain of the signal. Another problem with FNN lies in modeling the correlation between the time-frames of the audio signal. As FNN only concatenates several input frames together to model the time correlation, it often fails to model the long-term time correlations of the audio signal.
  • 2.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1388 Convolutional neural networks (CNNs) can alleviate the limitation of FNN using 2-dimensional filters, whose parameters are shared along the time and frequency shift [20], and it has exhibited superior performance to FNN in various pattern recognition tasks. Particularly, due to its structural characteristics, CNN can efficiently handle image distortions in the 2-dimensional space [10]. Similarly, we expect that the time-frequency domain distortions occurring in the audio signal can be accurately modeled by CNN. Nevertheless, CNN is inefficient for modeling long-term time correlations between audio signal samples. Recurrent neural networks (RNNs) have been used successfully in speech recognition [13], and are superior to other neural networks in modeling the time-correlation information of the time-series signals such as speech and audio. However, because RNN is unable to tackle 2-dimensional distortions in the time-frequency domain of the audio signal, its performance is usually inferior to CNN when used alone in acoustic event detection. Recently, there have been some approaches to combine CNN and RNN for their combined merits. Among them, convolutional recurrent neural networks (CRNNs) have been used successfully for acoustic event detection [21], speech recognition [22], and music classification [23]. As CRNN is constructed by connecting CNN, RNN, and FNN in series, it is more complex than other neural networks; therefore, the combined effect of the networks is difficult to predict. Moreover, as the use of CRNN on acoustic event detection is in its early stages, there are few research studies on optimizing the various hyper-parameters of CRNN. Therefore, we attempted to find the optimal hyper-parameters of CRNN for acoustic event detection in this study. Several experiments were performed to identify the optimal hyper-parameters of CRNN, and we used the test results on the validation data to determine the hyper-parameters. Important hyper-parameters, such as the input segment length, learning rate, and criterion for the convergence test were determined from the experiments. Additionally, the effects of batch normalization and dropout on the performance were observed. We also studied the effects of varying the batch data in every iteration during the training. This paper is organized as follows. In Section 2, we introduce the feature extraction method for audio signals as well as the architecture of the CRNN used for acoustic event detection. In Section 3, we present and discuss various experimental results, and conclusions are given in Section 4. 2. FEATURE EXTRACTION AND CRNN ARCHITECTURE 2.1. Feature extraction In this study, we used log-mel filterbank (LMFB) outputs as input features for CRNN and the entire process of feature extraction is shown in Figure 1. We first computed short-time Fourier transform (STFT) from the 40-ms audio signals, which are sampled at 44.1 KHz. STFTs were computed at every 20 ms with 50% overlap [21]. Further, 40-dimensional mel filterbanks were extracted from the STFTs spanning 0 to 22050 Hz, and they were log-transformed to obtain the LMFBs, which are normalized by subtracting the mean and dividing by standard deviation of the entire training data. Figure 1. Extraction process of log-mel filterbank (LMFB) 2.2. The architecture of CRNN Figure 2 presents the architecture of the CRNN used in this paper. CRNN consists of CNNs followed by RNN and FNN in sequence. The CNNs act as audio feature extractors, which are robust against distortions in the time-frequency domain. The RNN utilize the time-correlation information of the audio signals. Finally, the FNN serves as an output layer, which produces the posterior probabilities for each sound class at each time frame. As CNN takes 40-dimensional LMFBs as input features, the dimension of the input of the CNN is Tx40, where the length of the input segment is set to T. The CNN consists of 3 convolution layers, each of which has 256 feature maps with 5 5 filters. The output of the filter is processed by batch normalization and then passes through ReLU activation function. To maintain the time-domain dimension, max pooling is performed only in the frequency-domain. Further, dropout is applied at a rate of 0.25 after the max pooling layer [10]. The input segment length T is set to 1024 frames (20.48s). We experimented with different values of T to find the one that produced the best result on the validation dataset.
  • 3. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1389 The output of the CNN is input to the RNN, which consists of 256 gated recurrent units (GRUs). The output layer consists of K units with sigmoid activation function, where K represents the number of acoustic event classes. The sigmoid activation function produces the posterior probabilities for each class at each time frame, from which we decide whether an acoustic event is active based on a threshold (0.5). Figure 2. Architecture of the CRNN used for acoustic event detection 3. EXPERIMENTAL RESULTS 3.1. Database and evaluation metric To evaluate the performance of CRNN on acoustic event detection, we used the TUT sound events synthetic 2016 (TUT-SED Synthetic) database, which is popularly used in this area [24]. TUT-SED Synthetic contains artificially generated audio data, because it is difficult to obtain enough data using only audio data recorded in real environments. Moreover, the subjective labelling error can be mitigated by artificial data. TUT-SED Synthetic was generated by mixing isolated sound events from 16 different classes. The total length of the data is 566 minutes, which were divided into training, testing, and validation data with proportions of 60%, 20%, and 20%, respectively. Segments of length 3-15 seconds were used for the training, testing, and validation, and there were no common acoustic event instances between them. The detailed sound classes and their total duration in the database are shown in Table 1. For the evaluation metric of the acoustic event detection, we used both error rate (ER) and F-score [25]. We adopted two types of evaluation methods:, segment and event-based. In the segment-based method, the binarized outputs of the CRNN are compared with the ground truth table in every segment of length 1 s. In the event-based method, the output of the CRNN are compared with the label in the ground truth table whenever an acoustic event has been detected by CRNN [25]. We sought the optimal hyper-parameters of the CRNN by applying various conditions during the training. To find the optimal learning rate, we experimented as we changed the learning rate. The results are shown in Table 2. We applied batch normalization and dropout in all cases, and binary cross entropy was employed as the loss function, which acts as the criterion for the convergence of the weights. The Adam optimizer was used to optime the neural networks. As seen in Table 2, the optimal CRNN performance was achieved at a learning rate . Generally, as the optimal learning rate for the neural networks is not pre-determined, and varies by both network architecture and amount of training data, we searched for the optimal learning
  • 4.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1390 rate by observing the performance of the validation data set. We can see in Table 2 that the best learning rate for the testing data is same as the learning rate that achieves the best performance on the validation data. This indicates that it is reasonable to determine the optimal learning rate of CRNN by its performance on the validation data. Table 2 also shows that as the learning rate decreases, the number of epochs that shows the best performance increases. For example, the number of epochs is 33 when the learning rate is , whereas it increases to 191 with the learning rate of . The larger number of epochs indicates a slower convergence of the CRNN parameters, which causes performance degradation due to the underfitting of the neural networks. In contrast, when the learning increases to , the number of epochs decreases dramatically to 16, which causes overfitting and results in performance degradation. To investigate the performance variation with the learning rate further, we show, (as shown in Figure 3), the variation of the loss function and accuracy at the output of the CRNN during the training as the learning rate varied from to . When the learning rate is , it can be seen that the loss function on the validation data reaches its minimum at approximately 30 epochs (33 precisely) and subsequently fluctuates (but never falls below the minimum). However, for the training data, the loss function decreases from the beginning to the end of the training (we set the maximum number of epoch to 200). As it is important for the networks not to be overfitted, we stopped the iteration at 33 epochs by the early stopping algorithm mentioned previosuly. Meanwhile, we see different characteristics when the learning rate is . The loss function on the validation data decreases for longer and reaches its minimum at 157. The longer iterations contribute to decrease the performance of the CRNN with both validation and test data due to the underfitting problem. This phenomenon becomes more pronounced as we further decrease the learning rate. When the learning rate is , the loss function does not reach its minimum until the end of the training. A similar trend is observed when we monitor the accuracy instead of the loss function. We investigated the effect on performance as we changed the input batch data in every epoch of the training. The starting points of each batch data were shifted at each epoch making the input segments at successive epochs differ by the shift-length. For the evaluation, both segment-based and event-based methods were used and the results are shown in F-score and ER. We used ER as the convergence criterion for the training. Further, early stopping was employed where we stopped the training when the convergence criterion did not improve for more than 100 epochs on the validation data. This was to prevent overfitting, and the maximum number of epochs was set to 200. We also investigated the effects of batch normalization (BN) and dropout. BN has been used to mitigate the vanishing and exploding gradient problems in the backpropagation algorithm, and we used BN before the ReLU activation function in the convolution layer. Dropout is a popular regularization method in the neural networks, which is used to exclude the neurons from training at a predefined probability (dropout rate). In this study, dropout is used in all layers of the CNN and RNN (but not the FNN [18]. In Table 3, the results of using the shift of batch data are presented. In the segment-based evaluation, we see an improved average F-score/ER of 63.18%/0.52 using the shift compared to the F-score/ER of 62.11%/0.54 without the shift (non-shift). We also observe slight performance improvement in the event-based evaluation. From these results, we confirm that superior performance is expected by the batch data shift for training the CRNN. Table 3 also shows the effects of BN and dropout on performance. Expectedly, the best average F-score/ER is obtained when applying both BN and dropout (56.67%/0.71). If we apply only BN without dropout, we see slight performance degradation (54.54%/0.79), which implies that the effect of dropout on performances is not significant. Meanwhile, if we do not apply BN, the performance degrades significantly (regardless of whether we apply dropout), and the poorest result is obtained when we apply neither BN nor dropout (47.15%/0.81). In Table 4, we compare the performances of the CRNN between two convergence criteria (ER and F-score) for training. BN and dropout were applied and the overlap method was used. In the table, we observe slight performance improvementa with ER, but the performance difference appears negligible, and we conclude that the ER and F-scores can be used as the convergence criteria without significantly affecting the performance. Table 1. Sound classes and their total duration in seconds on the TUT-SED synthetic 2016 databases Classes Duration(s) Classes Duration(s) Glass 621 Motor cycle 3691 Gun 534 Foot steps 1173 Cat 941 Claps 3278 Dog 716 Bus 3464 Thunder 3007 Mixer 4020 Bird 2298 Crowd 4825 Horse 1614 Alarm 4405 Crying 2007 Rain 3975
  • 5. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1391 Table 2. Performance of CRNN as learning rate changes Leaning rate Validation data (F-score/ER) Testing data (F-score/ER) Epoch Segment Event Segment Event 61.69%/ 0.52 37.69%/0.96 60.61%/0.53 37.05%/0.97 16 68.75%/ 0.45 43.49%/0.88 64.21%/0.50 40.50%/0.96 33 66.44%/ 0.49 39.10%/0.96 63.76%/0.52 36.48%/1.04 157 44.16%/ 0.69 9.83%/1.24 43.38%/0.71 10.82%/1.27 191 Figure 3. Variation of loss function and accuracy as learning rate changes Table 3. Performance of CRNN with various training conditions BN Drop out Segment-based (F-score/ER) Event-based (F-score/ER) Average Shift Non-shift Shift Non-shift Yes No 66.10%/0.48 65.28%/0.54 43.80%/1.01 42.99%/1.12 54.54%/0.79 Yes Yes 67.24%/0.49 66.62%/0.49 45.93%/0.94 46.88%/0.92 56.67%/0.71 No No 58.58%/0.57 58.88%/0.58 35.47%/1.06 35.68%/1.04 47.15%/0.81 No Yes 60.80%/0.54 57.67%/0.56 40.02%/0.97 39.18%/1.00 49.4%/0.77 Average 63.18%/0.52 62.11%/0.54 41.3%/1.0 41.18%/1.02 Table 4. Performance of CRNN depending on the convergence criterion Convergence Criterion Segment-based (F-score/ER) Event-based (F-score/ER) Average ER 67.24%/0.49 45.93%/0.94 56.59%/0.72 F-score 66.45%/0.51 44.97%/0.98 55.71%/0.75 Finally, in Table 5, we show performance variation as we changed the length of the input segment of CRNN. Compared to shorter lengths (2.56s, 5.12s), longer length segments (10.24s, 20.56s) show better performances. This may be due to the fact that many of acoustic events in the TUT-SED Synthetic have long lengths and the RNN can efficiently capture the time-correlations in the long segments. Table 5. Performance of CRNN depending on the length of the input segment SegmentLength (s) Segment-based (F-scores/ER) Event-based (F-scorse/ER) Average 2.56 66.84%/0.51 42.75%/1.13 54.80%/0.82 5.12 67.47%/0.50 44.27%/1.04 55.87%/0.77 10.24 68.01%/0.47 45.33%/0.97 56.67%/0.72 20.56 67.24%/0.49 45.93%/0.94 56.54%/0.70
  • 6.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 9, No. 4, August 2020 : 1387 – 1393 1392 4. CONCLUSION For acoustic event detection, approaches based on deep neural networks have demonstrated superior performances to conventional machine learning methods, such as GMM and SVM. Among them, CRNN is thought to be well suited for acoustic event detection due to its ability to reduce signal distortions in the time-frequency domain as well as in exploiting the temporal-correlation information of the audio signal. In this study, we empolyed CRNN as the classifier for the acoustic event detection, and several of its conditions were tested by extensive experiments to determine its optimal hyper-parameters. In the experiments, by varying the learning rate, we found that the optimum performance is obtained when the learning rate is set to . From the results, we could also see that the learning rate that exhibits optimum performance on the validation data also performs best in the testing data. This suggests that it is reasonable to determine the optimal learning rate based on performance tests on the validation data. We further confirmed that BN and dropout contributed to improving the performance of CRNN. Particularly, BN had a larger impact on the performance improvement than dropout. Instead of using identical batch data at every iteration, we obtained improved performance by changing the batch data in every iteration, which resulted from increasing the number of training samples for CRNN. We further found that the length of the input segments of the CRNN also affects the performance. We obtained better performance using longer segments, as the acoustic event used in this paper had relatively long time-durations. ACKNOWLEDGEMENTS This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by Ministry of Education (No. 2018R1A2B6009328). REFERENCES [1] M. K. Nandwana, et al., “Robust Unsupervised Detection of Human Screams In Noisy Acoustic Environments,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, pp. 161-165, 2015. [2] M. Crocco, et al., “Audio Surveillance: A Systematic Review,” ACM Computing Surveys, no. 52, 2016. [3] J. Salamon and J. P. Bello, “Feature Learning with Deep Scattering for Urban Sound Analysis,” 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, pp. 724-728, 2015. [4] S. Ntalampiras, et al., "On Acoustic Surveillance of Hazardous Situations", 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, pp. 165-168, 2009. [5] Y. Wang, et al., “Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 2742-2746, 2016. [6] D. Stowell and D. Clayton, “Acoustic Event Detection for Multiple Overlapping Similar Sources 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, pp. 1-5, 2015. [7] G. Dekkers, et al., “DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel acoustics,” KU Leuven, Tech. Rep., 2018. [8] D. Stowell, et al., “Automatic Acoustic Detection of Birds through Deep Learning: the First Bird Audio Detection Challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp.1-21, 2018. [9] F. Briggs, et al., “Acoustic Classification of Multiple Simultaneous Bird Species: A Multi-instance Multi-Label Approach,” The Journal of the Acoustical Society of America, vol. 131, no. 6, pp.4640-4640, 2012. [10] A. Krizhevsky, et al., “Imagenet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems, vol. 25, no. 2, pp. 1097-1105, 2012. [11] K. He, et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770-778, 2016. [12] Olga Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015. [13] A. Graves, et al., “Speech Recognition with Deep Recurrent Neural Networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 6645-6649, 2013. [14] J. Chorowski, et al., “Attention-Based Models for Speech Recognition,” in Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577-585, 2015. [15] A. Hannun, et al., “Deep Speech: Scaling up End-to-End Speech Recognition,” arXiv: 1412.5567, 2014. [16] K. Cho, et al., “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724-1734, 2014. [17] I. Sutskever, et al., “Sequence to Sequence Learning with Neural Networks”, in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), pp. 3104-3112, 2014.
  • 7. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Performance analysis of the convolutional recurrent neural network on acoustic event … (Suk-Hwan Jung) 1393 [18] S. H. Chung and Y. J. Chung, “Comparison of Audio Event Detection Performance using DNN,” Journal of the Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, pp. 571-577, 2018. [19] E. Cakir, et al., “Polyphonic sound event detection using multilabel deep neural networks,” 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, pp. 1-7, 2015. [20] Y. LeCun, et al., “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [21] E. Cakir, et al., “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, 2017. [22] T. N. Sainath, et al., “Convolutional, Long Short-term Memory, Fully Connected Deep Neural Networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, pp. 4580-4584, 2015. [23] K. Choi, et al., “Convolutional Recurrent Neural Networks for Music Classification,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 2392-2396, 2017. [24] “TUT-SED Synthetic,” 2016. [Online]. Available at: http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016 [25] A. Mesaros, et al., “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, pp. 162-178, 2016. BIOGRAPHIES OF AUTHORS Sukwhan Jung received his B.Sc. and M.Sc. Degree in Electronics Engineering from Keimyung University, Daegu, South Korea in 2016 and 2018, respectively. He has been with Samju Electroincs Co. since March 2018. His main research interests are audio event detection under noisy environments and deep learning for artificial intelligence. Yongjoo Chung received his B.Sc. degree in Electronics Engineering from Seoul National University, Seoul, South Korea in 1988. He earned his M.Sc. and PhD degree in Electrical and Electronics Engineering from Korea Advanced Science and Technology, Daejon, South Korean in 1995. He is currently a Professor with the Department of Electronics Engineering at Keimyung University, Daegu, S. Korea. His research interests are in the areas of speech recognition, audio event detection, machine learning and pattern recognition.