Performance estimation based recurrent-convolutional encoder decoder for speech enhancement

International Journal of Advanced Science and Technology
Vol. 29, No. 05, (2020), pp. 772-777
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC
772
Performance estimation based recurrent-convolutional encoder
decoder for speech enhancement
A.Karthik1
, Dr. J.L Mazher Iqbal 2
1
Research scholar, Veltech Rangarajan Dr sagunthala R&D Institute of Science
and Technology, Chennai
2
Research Professor, Veltech Rangarajan Dr sagunthala R&D Institute of Science
and Technology, Chennai
Abstract
Speech is the key to our communication skills. As we use recorded speech to
communicate remotely with other human beings, we become more and more accustomed
to machines that simply "listen to us". The goal of the improvement is to improve the
intelligibility and / or the general perceptual quality of the degraded vocal signal through
audio signal processing techniques. Speech enhancement with noise reduction or noise
reduction is the most important field of speech improvement and is used for many
applications such as cell phones, VoIP, teleconferencing systems, voice recognition and
hearing aids. Speech enhancement is necessary for many applications where the clean
voice signal is important for further processing.
Keywords: Speech Recognition, Automatic Speech Recognition (ASR), Recurrent
Convolutional Encoder-Decoder (R-CED) network, PESQ, STOI and CER.
1. Introduction
Speech enhancement techniques focus primarily on removing noise from a voice signal.
The various types of noise and techniques to eliminate these noises. In recent years,
learning architectures based on deep neural networks (DNN) they have been very
successful in related areas such as speech recognition. The success of deep neural
networks (DNN) in automatic speech recognition has led to the study of deep neural
networks for ASR noise suppression and speech improvement. The central theme of using
DNN to improve speech is that speech noise corruption is a complex process and a
complex nonlinear model such as DNN is suitable for modelling it. Although there is very
little in-depth work on the usefulness of DNNs for improving speech, it has shown
promising results and could outperform classic SE methods .A common aspect in many of
these works is an assessment of the conditions of coupled or seen noise. The
corresponding or displayed conditions imply that the types of test noise (e.g. ground
noise) are the same as for training. Unlike classical methods, motivated by aspects of
signal processing, DNN-based methods are data-driven approaches and the corresponding
noise conditions may not be ideal for evaluating DNNs for improving speech. . In recent
years, learning architectures based on the deep neural network (DNN) have been very
successful in related areas such as speech recognition. The success of the deep neural
network (DNN) in automatic speech recognition has led to the search for deep neural
networks for noise suppression for ASR and speech improvement. The central theme of
using DNN to improve speech is that speech noise corruption is a complex process and a
complex nonlinear model such as DNN is suitable for modelling it. Although there is very
little in-depth work on the usefulness of DNNs for improving speech, it has shown
promising results and could outperform classic SE methods. A common aspect in many of
these works is an assessment of the conditions of coupled or seen noise. The
corresponding or displayed conditions imply that the types of test noise (e.g. ground
noise) are the same as for training. Unlike classical methods, motivated by aspects of
signal processing, DNN-based methods are data-driven approaches and the corresponding

Vol. 29, No. 05, (2020), pp. 772-777
773
noise conditions may not be ideal for evaluating DNNs for speech improvement.Speech
enhancement (SE) is a serious research problem in audio signal processing. The goal is to
improve the quality and intelligibility of voice signals corrupted by noise. Thanks to its
application in various sectors, such as automatic voice recognition, mobile
communication, hearing aids, etc.
2. Advantages on speech enhancement:
Free up cognitive working space
Allows the user to operate a computer by speaking to it
Eliminates handwriting, spelling problems
Always spells correctly (doesn't always recognize words correctly)
Allows dictation of text, commands
3. Disadvantages on speech enhancement
Assists with one stage of the writing process, not a solution to the writing
problem
Difficult to use in classroom settings, due to noise interference
Requires large amounts of memory to store voice files
Makes errors, can be frustrating without adequate support
Requires each user to train the software to recognize a voice, hard for poor
decoders
4. Application of speech enhancement
Speaker identification
Automatic speech recognition
Biomedical speech recognition
Cell phone speech recognition
5. Related work:
(Wang and Brookes 2018) presented an algorithm to improve the speech of the
modulation domain using the Kalman filter The proposed estimator jointly models the
estimated dynamics of the noise and speech spectral amplitudes to obtain an estimate of
the mean squared error estimator (MMSE) of the speech amplitude spectrum assuming
that noise and language are additive in the compound domain . Understand the dynamics
of noise amplitudes with those of speech amplitudes. Therefore, this work proposed the
statistical model "Gaussring" which contains a mixture of Gaussians whose centers are in
a circle on the complex plane. The performance of the proposed algorithm has been
estimated using the STOI measurement (short-term objective intelligibility), the PESQ
measure (perceptual assessment of speech quality) and the seg SNR measure (segmental
SNR). For measures of speech quality, the proposed algorithm was displayed to provide
constant improvement over a wide range of SNR while associated with competitive
algorithms. Speech recognition experiments also showed that the Gaussring-based
algorithm reaches two types of noise well
(Bando, et al. 2018) implemented a semi-supervised speech enhancement techniques
known as variation auto encoder–nonnegative matrix factorization (VAE-NMF), which
involved A probabilistic model of generative speech based on a VAE and this noise was
based on a non-negative matrix factorization. Here, only the vocal model has been pre-
trained to use a sufficient amount of clean voice. Using the vocal model as a pre-
distribution, it is possible to obtain subsequent estimates of the clean voice by using a
Monte Carlo Markov chain (MCMC) sample, familiarizing the noise model with noisy

Vol. 29, No. 05, (2020), pp. 772-777
774
environments. Experiments confirmed that VAE-NMF outperformed conventional
supervised techniques based on deep neural networks in invisible and noisy environments.
A next stimulating direction was to extend VAE-NMF to the multichannel scenario.
Meanwhile, a VAE and a well-studied linear phase model can mean complicated vocal
signals and a spatial mixing process, respectively, would be efficient to integrate these
models into a unified probabilistic structure. Also, consider GAN-based training of the
voice model to accurately learn a probability distribution of the voice.
(Donahue, et al. 2018) introduced the frequency-domain Speech Enhancement
Adverse Generative Networks (FSEGAN), a technique based on Adverse Generator
Networks (GAN) to perform speech improvement in the frequency domain, and revealed
improvements in the performance of Automatic Speech Recognition (ASR) in relation to
the previous time domain method. Then, it provided the evidence that was retrained;
FSEGAN could progress the performance of previous Multi-Style-Training (MTR)-
trained the ASR systems. Experiments have been indicated that for ASR as simpler
regression techniques may be preferable to GAN based improvement. It seems that
FSEGAN collects plausible spectra and could be more valuable for telephone applications
when combined with a representation of invertible characteristics.
(Pascual, et al. 2018) He proposed the performance of adapting speech improvement
to this generative confrontation network, adjusting the generator with the least amount of
data. In order to examine the minimum requirements, stable behaviors was obtained
in terms of various objective metrics and two different types of languages: Korean and
Catalan. The main objective of the study of the variability of the test performance in
relation to invisible noise as a function of the number of different types of noise was
available for the training set. Performance was revealed as the adaptation of the pre-
trained English model with ten minutes of data. It has already achieved comparable
performance by having two orders of magnitude more. In addition, they demonstrated
relative stability in the test performed in relation to the number of types of training noise.
(Zhao, et al. 2018) they elucidated the EHNET that combined recurrent neural
networks and convolutional neural networks to improve speech. EHNET's inductive bias
was adequate to address speech improvement. The convolution cores are able to
effectively detect local patterns in bidirectional connections and spectrograms; Recurring
connections can automatically model dynamic correlations between adjacent frames. Due
to the low nature of convolutions, EHNET required fewer calculations than the recurrent
neural network and machine learning programming. The performance of the results
demonstrated that EHNET consistently outperforms competitors in general in the five
different metrics. In addition, it was able to simplify the invisible noise that confirmed the
EHNET's effectiveness in improving speech.
6. Challenges to be overcome:
In the existing work, The classical techniques guided by the a priori and a posteriori
SNR decision become latent variables in the NRN, from which the estimated probability
dependent on the frequency of the presence of the speech is used to recursively update the
latent variables. , but the difference in recurrent neural networks (RNN) is very unstable if
ReLu is used as an activation function. Therefore, it is unable to process very long chains
due to the trigger function, RNNs cannot stack in very deep models and RNNs cannot
track long-term dependencies.

Vol. 29, No. 05, (2020), pp. 772-777
775
7. Proposed meet out
To improve the accuracy of speech improvement in the RCNN approach of an a
priori and a posteriori SNR.
To recover the quality of enhanced speech in the speech-present regions, and
extend the additive noise framework.
To show the efficiency of speech enhancement with the increasing dimension and
decreasing dimension is used by the Recurrent Convolutional Encoder-Decoder
(R-CED).
8. Proposed method
To overcome the above challenges, the speech enhancement is used to find the noise
free speech mainly it estimated the priori and posterior SNR. The priori SNR can be
understood as the true instantaneous power ratio between each spectral component of
clean speech and noise, while the posteriori SNR can be viewed as the instantaneous
power ratio between each spectral component of observed noisy speech and noise. In this
proposed work, a Recurrent Convolutional Encoder-Decoder (R-CED) network is used.
R-CED consists of repetitions of a convolution, batch normalization, and a ReLU
activation layer. R-CED encodes the features into higher dimension along the encoder and
achieves compression along the decoder. The number of filters is kept symmetric: at the
encoder, the number of filters is gradually increased, and at the decoder, the number of
filters is gradually decreased. Here initialize the trellis map, design the circuit logic,
perform LP norm decoding .Finally decoding. Finally maximum likelihood estimates by
traversing the Trellis Map Where prediction of distortion elements. Moreover, the process
of decoding it will get the noise free speech then the loss function occurred from the priori
SNR. At the loss function, MSE will calculate and compared with the threshold value, if
the value is greater than the MSE goes to the R-CED process. If the values are lesser than
the MSE, then the speech will enhanced. From this enhanced speech, the performance
analyzed as the metrics of SNR (Signal Noise Ratio), SDR (Signal to Distortion Ratio),
MSE (Mean Squared Error).
Algorithm / Techniques to be used
SNR based Recurrent-Convolutional Encoder Decoder (SNR- RCED)
Performance metrics:
PESQ(Perceptual Evaluation of Speech Quality)
STOI(Short-time objective intelligibility)
CER(Character Error Rate)
MSE (Mean Squared Error)
SNR (Signal Noise Ratio)
SDR (Signal to Distortion Ratio)

Vol. 29, No. 05, (2020), pp. 772-777
776
9. Flow of proposed work
Figure 1: Flow of the proposed work
References
[1] H. Zhao, et al., "Convolutional recurrent neural networks for speech enhancement,"
in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 2401-2405.
[2] H.-P. Liu, et al., "Bone-conducted speech enhancement using deep denoising
autoencoder," Speech Communication, vol. 104, pp. 106-112, 2018.
[3] Y. Zhao, et al., "Perceptually guided speech enhancement using deep neural
networks," in 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2018, pp. 5074-5078.
[4] Q. He, et al., "Multiplicative update of auto-regressive gains for codebook-based
speech enhancement," IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), vol. 25, pp. 457-468, 2017.
[5] R. Henni, et al., "A new efficient two-channel fast transversal adaptive filtering
algorithm for blind speech enhancement and acoustic noise reduction," Computers &
Electrical Engineering, vol. 73, pp. 349-368, 2019.
[6] Y. Xia and R. Stern, "A Priori SNR Estimation Based on a Recurrent Neural
Network for Robust Speech Enhancement," in Interspeech, 2018, pp. 3274-3278.
[7] X. Du, et al., "End-to-End Model for Speech Enhancement by Consistent
Spectrogram Masking," arXiv preprint arXiv:1901.00295, 2019.
[8] R. Bendoumia, "Two-channel forward NLMS algorithm combined with simple
variable step-sizes for speech quality enhancement," Analog Integrated Circuits and
Signal Processing, vol. 98, pp. 27-40, 2019.

Vol. 29, No. 05, (2020), pp. 772-777
777
[9] Y. Wang and M. Brookes, "Model-based speech enhancement in the modulation
domain," IEEE/ACM Transactions on Audio, Speech and Language Processing
(TASLP), vol. 26, pp. 580-594, 2018.
[10] Y. Bando, et al., "Statistical speech enhancement based on probabilistic integration
of variational autoencoder and non-negative matrix factorization," in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018, pp. 716-720.
[11] C. Donahue, et al., "Exploring speech enhancement with generative adversarial
networks for robust speech recognition," in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5024-5028.
[12] S. Pascual, et al., "Language and noise transfer in speech enhancement generative
adversarial network," in 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2018, pp. 5019-5023.
[13] W. Xue, et al., "Modulation-Domain Parametric Multichannel Kalman Filtering for
Speech Enhancement," in 2018 26th European Signal Processing Conference
(EUSIPCO), 2018, pp. 2509-2513.
[14] X. Leng, et al., "On Speech Enhancement Using Microphone Arrays in the Presence
of Co-Directional Interference," in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 511-515.
[15] Y. Bando, et al., "Speech enhancement based on Bayesian low-rank and sparse
decomposition of multichannel magnitude spectrograms," IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 26, pp. 215-230, 2017.
[16] S.China Venkateswarlu,A.karthik”Performance on Speech Enhancement Objective
Quality Measures Using Hybrid Wavelet Thresholding” International Journal of
Engineering and Advanced Technology publisher by Blue Eyes Intelligence
Engineering & Sciencespublication.vol.8,issue 6.pp.3523-3533,2019.

Performance estimation based recurrent-convolutional encoder decoder for speech enhancement

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Performance estimation based recurrent-convolutional encoder decoder for speech enhancement

Similar to Performance estimation based recurrent-convolutional encoder decoder for speech enhancement (20)

Recently uploaded

Recently uploaded (20)

Performance estimation based recurrent-convolutional encoder decoder for speech enhancement