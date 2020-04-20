Successfully reported this slideshow.
Performance estimation based recurrent-convolutional encoder
decoder for speech enhancement

  1. 1. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 772 Performance estimation based recurrent-convolutional encoder decoder for speech enhancement A.Karthik1 , Dr. J.L Mazher Iqbal 2 1 Research scholar, Veltech Rangarajan Dr sagunthala R&D Institute of Science and Technology, Chennai 2 Research Professor, Veltech Rangarajan Dr sagunthala R&D Institute of Science and Technology, Chennai Abstract Speech is the key to our communication skills. As we use recorded speech to communicate remotely with other human beings, we become more and more accustomed to machines that simply "listen to us". The goal of the improvement is to improve the intelligibility and / or the general perceptual quality of the degraded vocal signal through audio signal processing techniques. Speech enhancement with noise reduction or noise reduction is the most important field of speech improvement and is used for many applications such as cell phones, VoIP, teleconferencing systems, voice recognition and hearing aids. Speech enhancement is necessary for many applications where the clean voice signal is important for further processing. Keywords: Speech Recognition, Automatic Speech Recognition (ASR), Recurrent Convolutional Encoder-Decoder (R-CED) network, PESQ, STOI and CER. 1. Introduction Speech enhancement techniques focus primarily on removing noise from a voice signal. The various types of noise and techniques to eliminate these noises. In recent years, learning architectures based on deep neural networks (DNN) they have been very successful in related areas such as speech recognition. The success of deep neural networks (DNN) in automatic speech recognition has led to the study of deep neural networks for ASR noise suppression and speech improvement. The central theme of using DNN to improve speech is that speech noise corruption is a complex process and a complex nonlinear model such as DNN is suitable for modelling it. Although there is very little in-depth work on the usefulness of DNNs for improving speech, it has shown promising results and could outperform classic SE methods .A common aspect in many of these works is an assessment of the conditions of coupled or seen noise. The corresponding or displayed conditions imply that the types of test noise (e.g. ground noise) are the same as for training. Unlike classical methods, motivated by aspects of signal processing, DNN-based methods are data-driven approaches and the corresponding noise conditions may not be ideal for evaluating DNNs for improving speech. . In recent years, learning architectures based on the deep neural network (DNN) have been very successful in related areas such as speech recognition. The success of the deep neural network (DNN) in automatic speech recognition has led to the search for deep neural networks for noise suppression for ASR and speech improvement. The central theme of using DNN to improve speech is that speech noise corruption is a complex process and a complex nonlinear model such as DNN is suitable for modelling it. Although there is very little in-depth work on the usefulness of DNNs for improving speech, it has shown promising results and could outperform classic SE methods. A common aspect in many of these works is an assessment of the conditions of coupled or seen noise. The corresponding or displayed conditions imply that the types of test noise (e.g. ground noise) are the same as for training. Unlike classical methods, motivated by aspects of signal processing, DNN-based methods are data-driven approaches and the corresponding
  2. 2. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 773 noise conditions may not be ideal for evaluating DNNs for speech improvement.Speech enhancement (SE) is a serious research problem in audio signal processing. The goal is to improve the quality and intelligibility of voice signals corrupted by noise. Thanks to its application in various sectors, such as automatic voice recognition, mobile communication, hearing aids, etc. 2. Advantages on speech enhancement: Free up cognitive working space Allows the user to operate a computer by speaking to it Eliminates handwriting, spelling problems Always spells correctly (doesn't always recognize words correctly) Allows dictation of text, commands 3. Disadvantages on speech enhancement Assists with one stage of the writing process, not a solution to the writing problem Difficult to use in classroom settings, due to noise interference Requires large amounts of memory to store voice files Makes errors, can be frustrating without adequate support Requires each user to train the software to recognize a voice, hard for poor decoders 4. Application of speech enhancement Speaker identification Automatic speech recognition Biomedical speech recognition Cell phone speech recognition 5. Related work: (Wang and Brookes 2018) presented an algorithm to improve the speech of the modulation domain using the Kalman filter The proposed estimator jointly models the estimated dynamics of the noise and speech spectral amplitudes to obtain an estimate of the mean squared error estimator (MMSE) of the speech amplitude spectrum assuming that noise and language are additive in the compound domain . Understand the dynamics of noise amplitudes with those of speech amplitudes. Therefore, this work proposed the statistical model "Gaussring" which contains a mixture of Gaussians whose centers are in a circle on the complex plane. The performance of the proposed algorithm has been estimated using the STOI measurement (short-term objective intelligibility), the PESQ measure (perceptual assessment of speech quality) and the seg SNR measure (segmental SNR). For measures of speech quality, the proposed algorithm was displayed to provide constant improvement over a wide range of SNR while associated with competitive algorithms. Speech recognition experiments also showed that the Gaussring-based algorithm reaches two types of noise well (Bando, et al. 2018) implemented a semi-supervised speech enhancement techniques known as variation auto encoder–nonnegative matrix factorization (VAE-NMF), which involved A probabilistic model of generative speech based on a VAE and this noise was based on a non-negative matrix factorization. Here, only the vocal model has been pre- trained to use a sufficient amount of clean voice. Using the vocal model as a pre- distribution, it is possible to obtain subsequent estimates of the clean voice by using a Monte Carlo Markov chain (MCMC) sample, familiarizing the noise model with noisy
  3. 3. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 774 environments. Experiments confirmed that VAE-NMF outperformed conventional supervised techniques based on deep neural networks in invisible and noisy environments. A next stimulating direction was to extend VAE-NMF to the multichannel scenario. Meanwhile, a VAE and a well-studied linear phase model can mean complicated vocal signals and a spatial mixing process, respectively, would be efficient to integrate these models into a unified probabilistic structure. Also, consider GAN-based training of the voice model to accurately learn a probability distribution of the voice. (Donahue, et al. 2018) introduced the frequency-domain Speech Enhancement Adverse Generative Networks (FSEGAN), a technique based on Adverse Generator Networks (GAN) to perform speech improvement in the frequency domain, and revealed improvements in the performance of Automatic Speech Recognition (ASR) in relation to the previous time domain method. Then, it provided the evidence that was retrained; FSEGAN could progress the performance of previous Multi-Style-Training (MTR)- trained the ASR systems. Experiments have been indicated that for ASR as simpler regression techniques may be preferable to GAN based improvement. It seems that FSEGAN collects plausible spectra and could be more valuable for telephone applications when combined with a representation of invertible characteristics. (Pascual, et al. 2018) He proposed the performance of adapting speech improvement to this generative confrontation network, adjusting the generator with the least amount of data. In order to examine the minimum requirements, stable behaviors was obtained in terms of various objective metrics and two different types of languages: Korean and Catalan. The main objective of the study of the variability of the test performance in relation to invisible noise as a function of the number of different types of noise was available for the training set. Performance was revealed as the adaptation of the pre- trained English model with ten minutes of data. It has already achieved comparable performance by having two orders of magnitude more. In addition, they demonstrated relative stability in the test performed in relation to the number of types of training noise. (Zhao, et al. 2018) they elucidated the EHNET that combined recurrent neural networks and convolutional neural networks to improve speech. EHNET's inductive bias was adequate to address speech improvement. The convolution cores are able to effectively detect local patterns in bidirectional connections and spectrograms; Recurring connections can automatically model dynamic correlations between adjacent frames. Due to the low nature of convolutions, EHNET required fewer calculations than the recurrent neural network and machine learning programming. The performance of the results demonstrated that EHNET consistently outperforms competitors in general in the five different metrics. In addition, it was able to simplify the invisible noise that confirmed the EHNET's effectiveness in improving speech. 6. Challenges to be overcome: In the existing work, The classical techniques guided by the a priori and a posteriori SNR decision become latent variables in the NRN, from which the estimated probability dependent on the frequency of the presence of the speech is used to recursively update the latent variables. , but the difference in recurrent neural networks (RNN) is very unstable if ReLu is used as an activation function. Therefore, it is unable to process very long chains due to the trigger function, RNNs cannot stack in very deep models and RNNs cannot track long-term dependencies.
  4. 4. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 775 7. Proposed meet out To improve the accuracy of speech improvement in the RCNN approach of an a priori and a posteriori SNR. To recover the quality of enhanced speech in the speech-present regions, and extend the additive noise framework. To show the efficiency of speech enhancement with the increasing dimension and decreasing dimension is used by the Recurrent Convolutional Encoder-Decoder (R-CED). 8. Proposed method To overcome the above challenges, the speech enhancement is used to find the noise free speech mainly it estimated the priori and posterior SNR. The priori SNR can be understood as the true instantaneous power ratio between each spectral component of clean speech and noise, while the posteriori SNR can be viewed as the instantaneous power ratio between each spectral component of observed noisy speech and noise. In this proposed work, a Recurrent Convolutional Encoder-Decoder (R-CED) network is used. R-CED consists of repetitions of a convolution, batch normalization, and a ReLU activation layer. R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder. The number of filters is kept symmetric: at the encoder, the number of filters is gradually increased, and at the decoder, the number of filters is gradually decreased. Here initialize the trellis map, design the circuit logic, perform LP norm decoding .Finally decoding. Finally maximum likelihood estimates by traversing the Trellis Map Where prediction of distortion elements. Moreover, the process of decoding it will get the noise free speech then the loss function occurred from the priori SNR. At the loss function, MSE will calculate and compared with the threshold value, if the value is greater than the MSE goes to the R-CED process. If the values are lesser than the MSE, then the speech will enhanced. From this enhanced speech, the performance analyzed as the metrics of SNR (Signal Noise Ratio), SDR (Signal to Distortion Ratio), MSE (Mean Squared Error). Algorithm / Techniques to be used SNR based Recurrent-Convolutional Encoder Decoder (SNR- RCED) Performance metrics: PESQ(Perceptual Evaluation of Speech Quality) STOI(Short-time objective intelligibility) CER(Character Error Rate) MSE (Mean Squared Error) SNR (Signal Noise Ratio) SDR (Signal to Distortion Ratio)
  5. 5. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 776 9. Flow of proposed work Figure 1: Flow of the proposed work References [1] H. Zhao, et al., "Convolutional recurrent neural networks for speech enhancement," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2401-2405. [2] H.-P. Liu, et al., "Bone-conducted speech enhancement using deep denoising autoencoder," Speech Communication, vol. 104, pp. 106-112, 2018. [3] Y. Zhao, et al., "Perceptually guided speech enhancement using deep neural networks," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5074-5078. [4] Q. He, et al., "Multiplicative update of auto-regressive gains for codebook-based speech enhancement," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, pp. 457-468, 2017. [5] R. Henni, et al., "A new efficient two-channel fast transversal adaptive filtering algorithm for blind speech enhancement and acoustic noise reduction," Computers & Electrical Engineering, vol. 73, pp. 349-368, 2019. [6] Y. Xia and R. Stern, "A Priori SNR Estimation Based on a Recurrent Neural Network for Robust Speech Enhancement," in Interspeech, 2018, pp. 3274-3278. [7] X. Du, et al., "End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking," arXiv preprint arXiv:1901.00295, 2019. [8] R. Bendoumia, "Two-channel forward NLMS algorithm combined with simple variable step-sizes for speech quality enhancement," Analog Integrated Circuits and Signal Processing, vol. 98, pp. 27-40, 2019.
  6. 6. International Journal of Advanced Science and Technology Vol. 29, No. 05, (2020), pp. 772-777 ISSN: 2005-4238 IJAST Copyright ⓒ 2020 SERSC 777 [9] Y. Wang and M. Brookes, "Model-based speech enhancement in the modulation domain," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, pp. 580-594, 2018. [10] Y. Bando, et al., "Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 716-720. [11] C. Donahue, et al., "Exploring speech enhancement with generative adversarial networks for robust speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5024-5028. [12] S. Pascual, et al., "Language and noise transfer in speech enhancement generative adversarial network," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5019-5023. [13] W. Xue, et al., "Modulation-Domain Parametric Multichannel Kalman Filtering for Speech Enhancement," in 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2509-2513. [14] X. Leng, et al., "On Speech Enhancement Using Microphone Arrays in the Presence of Co-Directional Interference," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 511-515. [15] Y. Bando, et al., "Speech enhancement based on Bayesian low-rank and sparse decomposition of multichannel magnitude spectrograms," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 215-230, 2017. [16] S.China Venkateswarlu,A.karthik”Performance on Speech Enhancement Objective Quality Measures Using Hybrid Wavelet Thresholding” International Journal of Engineering and Advanced Technology publisher by Blue Eyes Intelligence Engineering & Sciencespublication.vol.8,issue 6.pp.3523-3533,2019.

