Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
A REVIEW OF LPC METHODS FOR ENHANCEMENT OF SPEECH SIGNALSijiert bestjournal
This paper presents a review of LPC methods for enhancement of speech signals. The purpose of all method of speech enhancement is to impr ove quality of speech signal by minimizing the background noise . This paper especially comments on Power Spectral Subtraction,Multiband Spectral Subtraction,Non - Linear Spectral Subtraction Method,MMSE Spectral Subtraction Method and Spectral Subtraction based on perceptual properties. As the spectral subtraction produces the Residual noise,Musical noise,Linear Predictive Analysis method is used to enhance the Speech.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...CSCJournals
This paper proposes a new voicing detection and pitch estimation method that is particularly robust for noisy speech. This method is based on the spectral analysis of the speech multi-scale product. The multi-scale product (MP) consists of making the product of wavelet transform coefficients. The wavelet used is the quadratic spline function. We argue that the spectral of Multi-scale Product Analysis is capable of revealing an estimate of a pitch-harmonic more accurately even in a heavy noisy scenario. We evaluate our approach on the Keele database. The experimental results show the robustness of our method for noisy speech, and the good performance for clean speech in comparison with state-of-the-art algorithms.
Realization and design of a pilot assist decision making system based on spee...csandit
A system based on speech recognition is proposed fo
r pilot assist decision-making. It is based
on a HIL aircraft simulation platform and uses the
microcontroller SPCE061A as the central
processor to achieve better reliability and higher
cost-effect performance. Technologies of
LPCC (linear predictive cepstral coding) and DTW (D
ynamic Time Warping) are applied for
isolated-word speech recognition to gain a smaller
amount of calculation and a better real-time
performance. Besides, we adopt the PWM (Pulse Width
Modulation) regulation technology to
effectively regulate each control surface by speech
, and thus to assist the pilot to make decisions.
By trial and error, it is proved that we have a sat
isfactory accuracy rate of speech recognition
and control effect. More importantly, our paper pro
vides a creative idea for intelligent human-
computer interaction and applications of speech rec
ognition in the field of aviation control. Our
system is also very easy to be extended and applied
International Journal of Engineering Research and Applications (IJERA) aims to cover the latest outstanding developments in the field of all Engineering Technologies & science.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
A REVIEW OF LPC METHODS FOR ENHANCEMENT OF SPEECH SIGNALSijiert bestjournal
This paper presents a review of LPC methods for enhancement of speech signals. The purpose of all method of speech enhancement is to impr ove quality of speech signal by minimizing the background noise . This paper especially comments on Power Spectral Subtraction,Multiband Spectral Subtraction,Non - Linear Spectral Subtraction Method,MMSE Spectral Subtraction Method and Spectral Subtraction based on perceptual properties. As the spectral subtraction produces the Residual noise,Musical noise,Linear Predictive Analysis method is used to enhance the Speech.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...CSCJournals
This paper proposes a new voicing detection and pitch estimation method that is particularly robust for noisy speech. This method is based on the spectral analysis of the speech multi-scale product. The multi-scale product (MP) consists of making the product of wavelet transform coefficients. The wavelet used is the quadratic spline function. We argue that the spectral of Multi-scale Product Analysis is capable of revealing an estimate of a pitch-harmonic more accurately even in a heavy noisy scenario. We evaluate our approach on the Keele database. The experimental results show the robustness of our method for noisy speech, and the good performance for clean speech in comparison with state-of-the-art algorithms.
Realization and design of a pilot assist decision making system based on spee...csandit
A system based on speech recognition is proposed fo
r pilot assist decision-making. It is based
on a HIL aircraft simulation platform and uses the
microcontroller SPCE061A as the central
processor to achieve better reliability and higher
cost-effect performance. Technologies of
LPCC (linear predictive cepstral coding) and DTW (D
ynamic Time Warping) are applied for
isolated-word speech recognition to gain a smaller
amount of calculation and a better real-time
performance. Besides, we adopt the PWM (Pulse Width
Modulation) regulation technology to
effectively regulate each control surface by speech
, and thus to assist the pilot to make decisions.
By trial and error, it is proved that we have a sat
isfactory accuracy rate of speech recognition
and control effect. More importantly, our paper pro
vides a creative idea for intelligent human-
computer interaction and applications of speech rec
ognition in the field of aviation control. Our
system is also very easy to be extended and applied
International Journal of Engineering Research and Applications (IJERA) aims to cover the latest outstanding developments in the field of all Engineering Technologies & science.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Comparison and Analysis Of LDM and LMS for an Application of a SpeechCSCJournals
Most of the automatic speech recognition (ASR) systems are based on Guassian Mixtures model. The output of these models depends on subphone states. We often measure and transform the speech signal in another form to enhance our ability to communicate. Speech recognition is the conversion from acoustic waveform into written equivalent message information. The nature of speech recognition problem is heavily dependent upon the constraints placed on the speaker, speaking situation and message context. Various speech recognition systems are available. The system which detects the hidden conditions of speech is the best model. LMS is one of the simple algorithm used to reconstruct the speech and linear dynamic model is also used to recognize the speech in noisy atmosphere..This paper is analysis and comparison between the LDM and a simple LMS algorithm which can be used for speech recognition purpose.
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
Usually, hearing impaired people use hearing aids which are implemented with speech enhancement
algorithms. Estimation of speech and estimation of nose are the components in single channel speech
enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power
spectrum for non stationary environment. VAD (Voice Activity Detector) is used to identify speech pauses
and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error) speech
enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual
aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio) based on uncertainty
parameter introduced for the benefits of hearing impaired people in non stationary environments to control
distortions. By estimation and updating of noise based on division of original pure signal into three parts
such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different
values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed
method will compared with any one method WAT(Weighted Average Technique) Hence by using
parameters SR (signal to residual spectrum ratio) and LLR (log like hood ratio), MMSE (Minim Mean
Square Error) in terms of segmented SNR and LLR.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
General Kalman Filter & Speech Enhancement for Speaker Identificationijcisjournal
Presence of noise increases the dimension of the information. A noise suppression algorithm is developed
with an idea of combining the General Kalman Filter and Estimate Maximization (EM) frame work.This
combination is helpful and effective in identifying noise characteristics of an acoustic environment.
Recursion between Estimate step and Maximization step enabled the algorithm to deal any model of noise.
The same Speech enhancement procedure in applied in the pre-processing stage of a conventional Speaker
identification method. Due to the non-stationary nature of noise and speech adaptive algorithms are
required. Algorithm is first applied for Speech enhancement problem and then extended to using it in the
pre-processing step of the Speaker identification. The present work is compared in terms of significant
metrics with existing and popular algorithms and results show that the developed algorithm is dominant
over them.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Line Spectral Pairs Based Voice Conversion using Radial Basis FunctionIDES Editor
Voice Conversion (VC) is a technique which morphs
the speaker dependent acoustical cues of the source speaker
to those of the target speaker. Speaker dependent acoustical
cues are characterized at different levels such as shape of
vocal tract and glottal excitation. In this paper, vocal tract
parameters and glottal excitations are characterized using
Line Spectral Pairs (LSP) and pitch residual respectively.
Strong generalization ability of Radial Basis Function (RBF)
is utilized to map the acoustical cues namely, LSP and pitch
residual of source speaker to that of target speaker. The
subjective and objective measures are used to evaluate the
comparative performance of RBF and state of the art GMM
based voice conversion system. Objective measures and
simulation results indicate that the RBF transformation model
performed better than GMM model. Subjective evaluations
illustrate that the proposed algorithm maintains target voice
individuality, naturalness and quality of the speech signal.
FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...IDES Editor
This paper presents an alternative approach for the
design of linear phase digital low pass FIR filter using Particle
Swarm Optimization with Constriction Factor and Inertia
Weight Approach (PSO-CFIWA). FIR filter design is a multimodal
optimization problem. The conventional gradient based
optimization techniques are not efficient for digital filter
design. Given the filter specification to be realized, PSO
algorithm generates a set of filter coefficients and tries to
meet the ideal frequency characteristic. In this paper, for the
given problem, the realization of the FIR filters of different
order has been performed. The simulation results have been
compared with the well accepted evolutionary algorithm such
as genetic algorithm (GA). The results justify that the proposed
filter design approach using PSO-CFIWA outperforms to that
of GA, not only in the accuracy of the designed filter but also
in the convergence speed and solution quality.
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
This paper is aimed to reduce background noise introduced in speech signal during capture, storage, transmission and processing using Spectral Subtraction algorithm. To consider the fact that colored noise corrupts the speech signal non-uniformly over different frequency bands, Multi-Band Spectral Subtraction (MBSS) approach is exploited wherein amount of noise subtracted from noisy speech signal is decided by a weighting factor. Choice of optimal values of weights decides the performance of the speech enhancement system. In this paper weights are decided based on SFM (Spectral Flatness Measure) than conventional SNR (Signal to Noise Ratio) based rule. Since SFM is able to provide true distinction between speech signal and noise signal. Spectrogram, Mean Opinion Score show that speech enhanced from proposed SFM based MBSS possess better perceptual quality and improved intelligibility than existing SNR based MBSS
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...sipij
This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.
Speech Enhancement for Nonstationary Noise Environmentssipij
In this paper, we present a simultaneous detection and estimation approach for speech enhancement in nonstationary noise environments. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude error, while under speech-absence, the distortion depends on a certain attenuation factor. Experimental results demonstrate the advantage of using the proposed simultaneous detection and estimation approach which facilitate suppression of nonstationary noise with a controlled level of speech distortion.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Comparison and Analysis Of LDM and LMS for an Application of a SpeechCSCJournals
Most of the automatic speech recognition (ASR) systems are based on Guassian Mixtures model. The output of these models depends on subphone states. We often measure and transform the speech signal in another form to enhance our ability to communicate. Speech recognition is the conversion from acoustic waveform into written equivalent message information. The nature of speech recognition problem is heavily dependent upon the constraints placed on the speaker, speaking situation and message context. Various speech recognition systems are available. The system which detects the hidden conditions of speech is the best model. LMS is one of the simple algorithm used to reconstruct the speech and linear dynamic model is also used to recognize the speech in noisy atmosphere..This paper is analysis and comparison between the LDM and a simple LMS algorithm which can be used for speech recognition purpose.
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
Usually, hearing impaired people use hearing aids which are implemented with speech enhancement
algorithms. Estimation of speech and estimation of nose are the components in single channel speech
enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power
spectrum for non stationary environment. VAD (Voice Activity Detector) is used to identify speech pauses
and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error) speech
enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual
aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio) based on uncertainty
parameter introduced for the benefits of hearing impaired people in non stationary environments to control
distortions. By estimation and updating of noise based on division of original pure signal into three parts
such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different
values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed
method will compared with any one method WAT(Weighted Average Technique) Hence by using
parameters SR (signal to residual spectrum ratio) and LLR (log like hood ratio), MMSE (Minim Mean
Square Error) in terms of segmented SNR and LLR.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
This paper presents compressive sensing technique used for speech reconstruction using linear predictive coding because the
speech is more sparse in LPC. DCT of a speech is taken and the DCT points of sparse speech are thrown away arbitrarily.
This is achieved by making some point in DCT domain to be zero by multiplying with mask functions. From the incomplete
points in DCT domain, the original speech is reconstructed using compressive sensing and the tool used is Gradient
Projection for Sparse Reconstruction. The performance of the result is compared with direct IDCT subjectively. The
experiment is done and it is observed that the performance is better for compressive sensing than the DCT.
General Kalman Filter & Speech Enhancement for Speaker Identificationijcisjournal
Presence of noise increases the dimension of the information. A noise suppression algorithm is developed
with an idea of combining the General Kalman Filter and Estimate Maximization (EM) frame work.This
combination is helpful and effective in identifying noise characteristics of an acoustic environment.
Recursion between Estimate step and Maximization step enabled the algorithm to deal any model of noise.
The same Speech enhancement procedure in applied in the pre-processing stage of a conventional Speaker
identification method. Due to the non-stationary nature of noise and speech adaptive algorithms are
required. Algorithm is first applied for Speech enhancement problem and then extended to using it in the
pre-processing step of the Speaker identification. The present work is compared in terms of significant
metrics with existing and popular algorithms and results show that the developed algorithm is dominant
over them.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Line Spectral Pairs Based Voice Conversion using Radial Basis FunctionIDES Editor
Voice Conversion (VC) is a technique which morphs
the speaker dependent acoustical cues of the source speaker
to those of the target speaker. Speaker dependent acoustical
cues are characterized at different levels such as shape of
vocal tract and glottal excitation. In this paper, vocal tract
parameters and glottal excitations are characterized using
Line Spectral Pairs (LSP) and pitch residual respectively.
Strong generalization ability of Radial Basis Function (RBF)
is utilized to map the acoustical cues namely, LSP and pitch
residual of source speaker to that of target speaker. The
subjective and objective measures are used to evaluate the
comparative performance of RBF and state of the art GMM
based voice conversion system. Objective measures and
simulation results indicate that the RBF transformation model
performed better than GMM model. Subjective evaluations
illustrate that the proposed algorithm maintains target voice
individuality, naturalness and quality of the speech signal.
FIR Filter Design using Particle Swarm Optimization with Constriction Factor ...IDES Editor
This paper presents an alternative approach for the
design of linear phase digital low pass FIR filter using Particle
Swarm Optimization with Constriction Factor and Inertia
Weight Approach (PSO-CFIWA). FIR filter design is a multimodal
optimization problem. The conventional gradient based
optimization techniques are not efficient for digital filter
design. Given the filter specification to be realized, PSO
algorithm generates a set of filter coefficients and tries to
meet the ideal frequency characteristic. In this paper, for the
given problem, the realization of the FIR filters of different
order has been performed. The simulation results have been
compared with the well accepted evolutionary algorithm such
as genetic algorithm (GA). The results justify that the proposed
filter design approach using PSO-CFIWA outperforms to that
of GA, not only in the accuracy of the designed filter but also
in the convergence speed and solution quality.
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
This paper is aimed to reduce background noise introduced in speech signal during capture, storage, transmission and processing using Spectral Subtraction algorithm. To consider the fact that colored noise corrupts the speech signal non-uniformly over different frequency bands, Multi-Band Spectral Subtraction (MBSS) approach is exploited wherein amount of noise subtracted from noisy speech signal is decided by a weighting factor. Choice of optimal values of weights decides the performance of the speech enhancement system. In this paper weights are decided based on SFM (Spectral Flatness Measure) than conventional SNR (Signal to Noise Ratio) based rule. Since SFM is able to provide true distinction between speech signal and noise signal. Spectrogram, Mean Opinion Score show that speech enhanced from proposed SFM based MBSS possess better perceptual quality and improved intelligibility than existing SNR based MBSS
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...sipij
This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.
Speech Enhancement for Nonstationary Noise Environmentssipij
In this paper, we present a simultaneous detection and estimation approach for speech enhancement in nonstationary noise environments. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude error, while under speech-absence, the distortion depends on a certain attenuation factor. Experimental results demonstrate the advantage of using the proposed simultaneous detection and estimation approach which facilitate suppression of nonstationary noise with a controlled level of speech distortion.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
LPC Models and Different Speech Enhancement Techniques- A Reviewijiert bestjournal
Author has already published one review paper on the quality enhancement of a speech signal by minimizing the noise. This is a second paper of same series. In last two decades the researchers have taken continuous efforts to reduce the noise signal from the speech signal. Th is paper comments on,various study carried out and analysis propos als of the researchers for en hancement of the quality of speech signal. Various models,coding,speech quality improvement methods,speaker dependent codebooks,autocorrelation subtraction,speech restoration,producing speech at low bit rates,compression and enhancement are the vari ous aspects of speech enhancement. We have presented the review of all above mentioned technologies in this paper and also willing to examine few of the techniques in order to analyze the factors affecting them in upcoming paper of the series.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
Broad Phoneme Classification Using Signal Based Features ijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme classification is achieved using features derived directly from the speech at the signal level itself. Broad phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first three formants. Features derived from short time frames of training speech are used to train a multilayer feedforward neural network based classifier with manually marked class label as output and classification accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction which is useful for applications such as automatic speech recognition and automatic language identification.
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...CSCJournals
In this paper a new thresholding based speech enhancement approach is presented, where the threshold is statistically determined by employing the Teager energy operation on the Wavelet Packet (WP) coefficients of noisy speech. The threshold thus obtained is applied on the WP coefficients of the noisy speech by using a hard thresholding function in order to obtain an enhanced speech. Detailed simulations are carried out in the presence of white, car, pink, and babble noises to evaluate the performance of the proposed method. Standard objective measures, spectrogram representations and subjective listening tests show that the proposed method outperforms the existing state-of-the-art thresholding based speech enhancement approaches for noisy speech from high to low levels of SNR.
Classification improvement of spoken arabic language based on radial basis fu...IJECEIAES
The important task in the computer interaction is the languages recognition and classification. In the Arab world, there is a persistent need for the Arabic spoken language recognition To help those who have lost the upper parties in doing what they want through speech computer interaction. While, the Arabic automatic speech recognition (AASR) did not receive the desired attention from the researchers. In this paper, the Radial Basis Function(RBF) is used for the improvement of the Arabic spoken language letter. The recognition and classification process are based on three steps; these are; preprocessing, feature extraction and classification (Recognition). The Arabic Language Letters (ALL) recognition is done by using the combination between the statistical features and the Temporal Radial Basis Function for different letter situation and noisy condition. The recognition percent are from 90% - 99.375% has been gained with independent speaker, where these results are over-perform the earlier works by nearly 2.045%. The simulati.on has been made by using Matlab 2015b.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Speaker recognition is the computing task of validating a user's claimed identity using characteristics extracted from their voices. Voice -recognition is combination of the two where it uses learned aspects of a speaker’s voice to determine what is being said - such a system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices it has been trained with, which gives us various applications in day today life.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
P ERFORMANCE A NALYSIS O F A DAPTIVE N OISE C ANCELLER E MPLOYING N LMS A LG...ijwmn
n voice communication systems, noise cancellation
using adaptive digital filter is a renowned techniq
ue
for extracting desired speech signal through elimin
ating noise from the speech signal corrupted by noi
se.
In this paper, the performance of adaptive noise ca
nceller of Finite Impulse Response (FIR) type has b
een
analysed employing NLMS (Normalized Least Mean Squa
re) algorithm.
An extensive study has been made
to investigate the effects of different parameters,
such as number of filter coefficients, number of s
amples,
step size, and input noise level, on the performanc
e of the adaptive noise cancelling system. All the
results
have been obtained using computer simulations built
on MATLAB platform.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Isolated word recognition using lpc & vector quantizationeSAT Journals
Abstract Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human recognition and their behavior. This paper explicates the theory and implementation of Speech recognition. This is a speaker-dependent real time isolated word recognizer. The major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by measuring the Minimum average distortion. All Speech Recognition systems contain Two Main Phases, namely Training Phase and Testing Phase. In the Training Phase, the Features of the words are extracted and during the recognition phase feature matching Takes place. The feature or the template thus extracted is stored in the data base, during the recognition phase the extracted features are compared with the template in the database. The features of the words are extracted by using LPC analysis. Vector Quantization is used for generating the code books. Finally the recognition decision is made based on the matching score. MATLAB will be used to implement this concept to achieve further understanding. Index Terms: Speech Recognition, LPC, Vector Quantization, and Code Book.
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANIJNSA Journal
The signal to noise ratio (SNR) is one of the important measures for reducing the noise.A technique that uses a linear prediction error filter (LPEF) and an adaptive digital filter (ADF) to achieve noise reduction in a speech and image degraded by additive background noise is proposed. Since a speech signal can be represented as the stationary signal over a short interval of time, most of speech signal can be predicted by the LPEF. This estimation is performed by the ADF which is used as system identification. Noise reduction is achieved by subtracting the reconstructed noise from the speech degraded by additive background noise. Most of the MR image accelerating methods suffers from degradation of acquired images, which is often correlated with the degree of acceleration. However, Wideband MRI is a novel technique that transcends such flaws.In this paper we proposed LPEF and ADF for reducing the noise in speech and also we demonstrate that Wideband MRI is capable of obtaining images with identical quality as conventional MR images in terms of SNR in wireless LAN.
Similar to ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOGNITION (20)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Designing Great Products: The Power of Design and Leadership by Chief Designe...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOGNITION
1. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
DOI : 10.5121/sipij.2017.8103 23
ROBUST FEATURE EXTRACTION USING
AUTOCORRELATION DOMAIN FOR NOISY
SPEECH RECOGNITION
Gholamreza Farahani
Department of Electrical Engineering and Information Technology, Iranian Research
Organization for Science and Technology (IROST), Tehran, Iran
ABSTRACT
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
KEYWORDS
Autocorrelation, Noise, Robustness, Speech Recognition
1. INTRODUCTION
Mismatch between speech data collected in a laboratory environment, usually used in system
training, with those collected in real environments is known to be among the major reasons for
speech recognizer performance degradations. Various sources may cause such a mismatch
including additive background noise, convolutional channel distortions, acoustic echo and
different interfering signals. In this paper, additive background noise is our major concern.
To improve the recognition performance in the presence of additive noise, several approaches
have been proposed during the past few decades. While these methods can be very roughly
classified into model-based and feature-based, we are more interested in feature-based robust
speech recognition.
2. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
24
If one aims to appropriately handle mismatches in the features, he may either try to improve the
signal quality before starting to extract recognition features or may try to develop features that are
more robust to noise. The first approach is usually known as speech enhancement and is usually
dealt with separately from the issue of speech recognition. There are many techniques proposed
to solve the speech enhancement problem, most of which concentrate on the spectral domain. On
the other hand, several approaches try to extract more noise-robust features for speech
recognition. Such methods try to improve recognition performance in comparison to the rather
standard features, Mel-Frequency Cepstral Coefficients (MFCC) that have shown good
performance in clean-train/clean-test conditions, but deteriorated performance in the cases of
mismatch. A very well-known and widely used enhancement method that deals with the signal
spectrum is Spectral Subtraction (SS) [1]. Although spectral subtraction is simple in
implementation, some levels of success have been observed from its use in combination with
speech recognizers. However, this has been limited. Inherent errors in this approach, such as
phase, magnitude and cross term errors [2], can lead to performance limitations in enhancement.
However, when used in combination with speech recognition systems, some of these errors can
be disregarded. Meanwhile some other enhancement methods have been able to achieve more
improved performance when used in combination with speech recognizers.
Plenty of research work has been dedicated to extraction of more robust features for speech
recognition. One approach, that we are particularly interested in, and has shown some degrees of
success in recent works, is the use of autocorrelation in the feature extraction process.
Autocorrelation, among its different properties, is known to have a pole preserving property [3].
As an example, if the original signal is modeled by an all-pole sequence, the poles of the
autocorrelation sequence will be the same as those of the original signal. Therefore, there exists a
possibility of replacing features extracted from the original speech signal with those extracted
from its autocorrelation sequence. Consequently, any effort resulting in an improved
autocorrelation sequence in the presence of noise could also be helpful in finding more
appropriate speech features.
Autocorrelation domain is useful in the different parts related to speech. In Reference [4],
different methods of separating voiced and unvoiced segments of a speech signals based on short
time energy calculation, short time magnitude calculation, and zero crossing rate calculation on
the basis of autocorrelation of different segments of speech signals is introduced. Pitch detection
algorithms (PDA) for simple audio signals based on zero-cross rate (ZCR) and autocorrelation
function (ACF) in Reference [5] is presented.
Several methods have been reported in autocorrelation domain, leading to more robust sets of
features. These methods may be divided into two groups: one dealing with the magnitude of the
autocorrelation sequence whilst the other works on the phase of the autocorrelation sequence.
Dealing with the magnitude of the autocorrelation sequence, which is our concern in this paper,
among the most successful methods, we can name Differentiated Relative Autocorrelation
Sequence Spectrum (DRASS) [6], Short-time Modified Coherence (SMC) [7], One-Sided
Autocorrelation LPC (OSALPC) [8], Relative Autocorrelation Sequence (RAS) [9],
Autocorrelation Mel Frequency Cepstral Coefficients (AMFCC) [10] and Differentiation of
Autocorrelation Sequence (DAS) [11]. Also, it has been shown that the use of spectral peaks
obtained from a filtered autocorrelation sequence can lead to a good performance under noisy
conditions [12, 13].
3. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
25
In DRASS, autocorrelation will calculate by biased estimator after frame blocking and pre-
emphasis. Then after filtering, FFT will calculate and absolute amplitude of differentiated FFT
square amplitude will use for Mel scale frequency bank. Finally log of coefficients and cepstrum
of them will use as DRASS coefficients.
In SMC, after calculation of autocorrelation with coherence estimation and hamming filtering, the
FFT of autocorrelation amplitude is found. Then, applying IFFT, the LPC coefficients are
calculated with Levinson-Durbin method and finally the cepstrum of LPC is found as SMC
coefficients. In OSALPC, calculation of autocorrelation is carried out by biased estimator and
Hamming filtering, the LPC coefficients are calculated using Levinson-Durbin method and the
LPC cepstrals found as the final coefficients. Among methods that have made use of the phase of
the autocorrelation sequence to obtain a more robust set of features we can name Phase
AutoCorrelation (PAC) approach [14] and Autocorrelation Peaks and Phase features (APP) [13].
In this paper, we will consider a few developed autocorrelation-based methods and discuss their
approach to achieving robustness. Then we will explain a simple method that can lead to better
results in robust speech recognition in comparison to its predecessors in autocorrelation domain.
Later, we will discuss the issue of the error terms introduced in this approach due to the
estimation of noise autocorrelation sequence. We will show that taking into account the above
parameters in the estimation of clean signal autocorrelation sequence can lead to even better
system performance.
The remainder of this paper is organized as follows. In Section 2 we will present the theory
behind some autocorrelation-based approaches. Section 3 is dedicated to discussion on our
Autocorrelation-based Noise Subtraction (ANS) approach and its derivatives. In Section 4, some
implementation issues regarding the proposed methods will be addressed. Section 5 includes the
experimental results on Aurora 2 task and compares our results with those of the traditional
methods such as MFCC and other autocorrelation-based methods. Section 6 will conclude the
paper.
2. REVIEW OF THE AUTOCORRELATION-BASED METHODS
In this section, we will describe a few methods in autocorrelation domain. This will give us an
appropriate insight to the advantages and disadvantages of using autocorrelation in robust feature
extraction.
2.1. Formulation of Clean and Noisy Speech Signals and Noise in Autocorrelation Domain
We start by explaining the relationship between the autocorrelation sequences of clean and noisy
signals and noise. Assuming v(m,n) to be the additive noise and x(m,n) clean speech signal , the
noisy speech signal, y(m,n), could be written as
where N is the frame length, n is the discrete time index in a frame, m is the frame index and M is
the number of frames. Note that in this paper, as our goal is suppression of the effect of additive
noise fromnoisy speech signal, the channel effects not considered. If x(m,n) and v(m,n) are
considered uncorrelated, then the autocorrelation of the noisy speech signal can be written as
4. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
26
where ryy (m,k), rxx(m,k) and rw(m,k) are the short-time one-sided autocorrelation sequences of the
noisy speech, clean speech and noise respectively and k is the autocorrelation sequence index
within each frame. The one-sided autocorrelation sequence of noisy speech signal may be
calculated using an unbiased estimator, i.e.
Meanwhile, although reasonable in practice, considering the clean speech signal, x(m,n), and
noise, v(m,n), completely uncorrelated may not always be an accurate assumption. We will
discuss this issue later. In a more general case, equation (2) should be written as
If the noise autocorrelation sequence is assumed relatively constant across frames, we can find an
estimate of rw(m,k) using the non-speech sections of an utterance, specified for example, by a
voice activity detector (VAD) or the initial normally non-speech periods and denote it as
Then we will have
Obviously, an assumption of v(m,n) having zero mean and being uncorrelated with x(m,n) will
reduce the terms rxv(m,k) and rvx(m,k) to zero [15]
2.2. Autocorrelation-based Methods for Robust Feature Extraction
Recently, several autocorrelation-based methods have been proposed, where usually, the speech
signal and noise were considered uncorrelated. We will describe some of these methods here, in
order to get some insight on how autocorrelation properties may be used to achieve robustness.
2.2.1. Relative Autocorrelation Sequence (RAS)
As explained in reference [9], this method assumed the noise as stationary and uncorrelated to the
speech signal. Therefore, the relationship between the autocorrelations of noisy and clean signals
and noise could be written as
If the noise part could be considered stationary, differentiating both sides of equation (6) with
reference to the frame index m would remove the effect of noise from the results, i.e.
5. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
27
The right side of equation (7) is equal to filtering on the one-sided autocorrelation sequence by a
high-pass FIR filter, where L is the length of the filter. This high pass filter (differentiation),
named RAS filter, was used to suppress the effect of noise in the autocorrelation sequence of the
noisy signal. Therefore, this method is appropriate for noises which have slow variations in the
autocorrelation domain, i.e. could be considered as relatively stationary.
After calculating one-sided autocorrelation sequence and differentiating both sides of equation (6)
with respect to m, the autocorrelation of noise was removed, i.e. differentiation of noisy speech
signal is equal to the differentiation of clean speech signal with respect to the frame index, m, in
autocorrelation domain. Obviously, this filtering will also have some slight negative effects on
the lower modulation frequencies of speech. However, this has been found to be quite small (refer
to Section 5 for RAS performance in clean speech conditions).
2.2.2. Autocorrelation Mel-frequency Cepstral Coefficients (AMFCC)
In this approach [10], the MFCC coefficients were extracted from the noisy signal autocorrelation
sequence after removing some of its lower lag coefficients. These lower lag coefficients were
shown to have the highest influence on the noisy signal for many noise types, including those
with least correlations among frames. The lag threshold value used was 3 msec. and was set by
finding the first valley in the absolute autocorrelation function found over TIMIT speech frames.
As reported in reference [10], this method works well for car and subway noises in Aurora 2 task,
but not for babble and exhibition noises. The reason was believed to be wider autocorrelation
functions of the latter ones. However, for some other noise types, such as babble, they are spread
out in different lags. Therefore, the main reason for limited success of AMFCC in noises such as
babble and exhibition is that the noise autocorrelation properties are more similar to those of the
speech signal, which makes their separation difficult.
2.2.3. Differentiation of Autocorrelation Sequence (DAS)
This algorithm combines the use of the enhanced autocorrelation sequence of the noisy speech
and the spectral peaks found from the autocorrelation sequence, as they are known to convey the
most important information of the speech signal [11].
In this method, in order to preserve speech spectral peaks, spectral differentiation has been used.
With this differentiation, the flat parts of the spectrum were almost removed and each spectral
peak was split into two, one positive and one negative. The differential power spectrum of the
noisy signal was defined as
6. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
28
where P and Q are the orders of the difference equation, a1 are real-valued coefficients and K is
the length of FFT (on the positive frequency side) [16]. The differentiation mentioned in equation
(8) can be carried out in several ways, as discussed in reference [16]. The simple difference had
shown the best results and therefore was used in reference [11], i.e.
The procedure of feature extraction was carried out after high-pass filtering (as in equation (7))
and peak extraction (as in equation (9)). As explained earlier for RAS, this filtering can suppress
the effect of slowly varying noises and also attenuate the effect of slow variation noise on the
speech signal. The spectral peaks were then extracted through differentiation of the spectrum
found using the filtered autocorrelation sequence, leading to better suppression of the noise effect.
Finally, an MFCC-like feature set was extracted and used in recognition experiments.
2.2.4. Spectral Peaks of Filtered Higher-lag Autocorrelation Sequence (SPFH)
This method was proposed to overcome the main drawback of AMFCC, i.e. its inability to deal
with noises that have autocorrelation components spread out over different lags [17].
In SPFH, after frame blocking and pre-emphasis of the noisy signal, the autocorrelation sequence
of the frame signal was obtained as in equation (3) and its lower lags were removed. A FIR high-
pass filter, similar to RAS filter, was then applied to the signal autocorrelation sequence to further
suppress the effect of noise, as in equation (7). Then, Hamming windowing and short-time
Fourier transform were carried out and the differential power spectrum of the filtered signal was
found using equation (9). Since the noise spectrum may, in many occasions, be considered flat, in
comparison to the speech spectrum, the differentiation either reduces or omits these relatively flat
parts of the spectrum, leading to even further suppression of the effect of noise. The final stages
included applying the resultant magnitude of the differentiated autocorrelation-derived power
spectrum to a conventional mel-frequency filter-bank and passing the logarithm of the outputs to
a DCT block to extract a set of cepstral coefficients per frame.
In fact, the SPFH method tried to attenuate the effect of noise after preserving higher lags of
noisy autocorrelation sequence by high-pass filtering, as in equation (7), and preserving spectral
peaks, as in equation (9), i.e. similar to DAS.
3. NOISE SUBTRACTION IN AUTOCORRELATION DOMAIN
3.1. Autocorrelation-based Noise Subtraction (ANS)
As an ideal assumption, we can consider the autocorrelation of noise as a unit sample at the origin
and zero at other lags. Therefore that portion of noisy speech autocorrelation sequence which is
far enough from the origin will have the same autocorrelation as clean speech signal. This ideal
assumption is of course only true for white noise and for real environmental noises, components
in lags other than zero are also available.
Investigations showed that there exist some major autocorrelation components for these noises
concentrated around the origin. This was the reason for introducing AMFCC and SPFH methods
mentioned earlier. However, as these methods drop the lower lags of the autocorrelation sequence
7. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
29
of the noisy speech signal to suppress the effect of noise, they are not useful for the cases where
important components are seen in higher autocorrelation lags of the noise, i.e. above 20 to 25
samples. In such cases, AMFCC approach not only does not completely suppress the effect of
noise, but also removes some probably useful lower lag portions of the autocorrelation sequence
of the speech signal. As an alternative to such methods, we follow a newer approach. Here, in
place of removing the lower lag autocorrelation components of the noisy signal, we try to
estimate the noise autocorrelation sequence and deduct it from the noisy signal autocorrelation
sequence. This is conceptually similar to the well-known spectral subtraction with the exception
that it is not magnitude spectrum, but to the autocorrelation sequence [17]. An instant advantage
is that there is no need to deal with phase issue.
In reference [17], the average autocorrelation of a number of non-speech frames of the utterance
is used as an estimate of the noise autocorrelation sequence. We write this as
where P is the number of non-speech frames of the utterance used and is the noise
autocorrelation estimate.
Therefore, we may write the estimate of the autocorrelation sequence of the clean speech signal
as
In order to estimate the noise autocorrelation in ANS method, a voice activity detector (VAD), or
the initial silence of the speech utterances can be used. Note that procedures similar to many other
widely-used noise estimation methods could also be used here.
3.2. The Cross Correlation Term
Figure 1 displays the autocorrelation sequences for two examples of clean speech, noise and
noisy speech signals with the noises being babble and factory, extracted from the NATO RSG-10
corpus [18], as well as the sum of autocorrelation sequences of speech and noise. One should
expect the speech signal and noise, in most circumstances, to be completely uncorrelated
However, in this case, according to figure 1, the autocorrelation sequence of the noisy speech is
not equal to the sum of those of clean speech and noise. In order to be able to have a more
accurate estimate of the speech signal autocorrelation, one needs to consider some correlation
among speech and noise signals to compensate for this difference. It should be noted that this
difference is in fact due to the short-time nature of our analysis, as the simple form of additive
autocorrelation mentioned in equation (2) is only possible when an infinitely long signal is
considered in the analysis [19]. We have used the two following approaches in order to consider
the cross correlation term in autocorrelation calculations:
8. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
30
(a)
(b)
Figure 1. Sample autocorrelation sequences of the clean speech, noisy speech and noise as well as sum of
the autocorrelations of clean speech signal and noise with (a) babble noise and (b) factory noise, with an
SNR of 10dB.
3.2.1. Kernel Method
Generally, assuming the speech signal and noise to be completely uncorrelated, we write the
autocorrelation of the noisy speech signal as the sum of the autocorrelations of clean speech
signal and noise. If we also consider the above mentioned correlation between the clean speech
signal and noise, the relationship between the autocorrelations mentioned in equation (6) should
change to:
9. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
31
10. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
32
3.2.2 Autocorrelation Averaging
We used autocorrelation averaging as an alternative way for reducing the observed correlation
effect between noise and clean speech signal. We remind the reader that, as already mentioned,
this correlation might even solely be the result of autocorrelation analysis on finite-duration
signals. In reference [21], it was shown that a smoothing approach can help in spectral subtraction
to overcome the speech/noise correlation problems. The reason is that the probability density
function (pdf) of the cosine of the angle between speech and noise vectors has been shown to
have a minimum at zero, while smoothing leads to a pdf with a maximum at zero and smaller
variances with larger numbers of frames taking part in smoothing 1. As a result, as will be further
explained later, ignoring the term including ,i.e assuming = 0, would be less
harmful after smoothing. We define the average of the noisy autocorrelation sequence as
i.e. weighted averaging of the noisy speech autocorrelation on T frames where bi is a weighting
parameter larger than 0 and less than or equal to 1.
------------------------------
1
A more detailed discussion on this issue can be found in [20].
11. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
33
By replacing ryy (m–i,k) in equation (21) with the value found in equation (4) we have :
If the variations in noise and speech could be assumed negligible during a period T, we can write
We will call this approach autocorrelation-based noise subtraction with smoothing (ANSS).
Details of setting of the length of averaging window in this approach will be discussed in the
parameter setting Section 4.3.
12. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
34
3.3 ANS versus Spectral Subtraction
Due to the similarity of ANS and spectral subtraction (SS) in concept, in this section, we would
like to make a comparison between the two methods. The first, and by far the most important,
difference between these two methods is that the subtraction in SS takes place in spectral domain
whereas for ANS, the subtraction is carried out in the autocorrelation domain (time domain). Note
that in the implementation of spectral subtraction, reported in this section, the overestimation
factor is set equal to that used for ANS and the flooring parameter was set to 0.002.
Although traditional spectral subtraction suffers from a few problems that affect the quality of
enhanced speech, the important source of distortion in this method is known to be the negative
values encountered during subtraction that should be mapped to a spectral floor [22]. This
nonlinear mapping causes an effect that is usually known as musical noise and is always
associated with the basic spectral subtraction method.
In ANS, as the subtraction is carried out in autocorrelation domain, negative and positive values
are not treated differently, and therefore, there is no need for flooring or other non-linear
mappings. In fact, problems associated with non-linearity are not encountered anymore and
inaccuracies in speech spectral estimates are only due to errors in noise autocorrelation estimation
and its associated problems.
Figure 4 displays the power spectra of a frame of an utterance of the word "one", uttered by a
female speaker and contaminated with train station noise at 0dB and 10dB SNRs. This utterance
is extracted from test set A of the Aurora 2 task. In this figure, the power spectra of signal after
the application of ANS and spectral subtraction are shown. As it is clear, the power spectrum
extracted after the application of ANS to the noisy speech closely follows the peaks and valleys
of the clean spectrum while the SS-treated one has a more different appearance.
The normalized average spectral errors of both methods have also been shown in Table 1.
Apparently, the Root Mean Square Error (RMSE) of ANS is much less than that of spectral
subtraction.
13. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
35
Figure 4. Log power spectrum of a speech frame of 'FAK_1B.08' utterance from test set A of Aurora 2 task
contaminated with subway noise at 0dB and 10dB SNRs in logarithmic scale.
Table 1. Normalized Average of Spectral Subtraction and ANS Spectrum Errors (RMSE Criteria) on Test
Set A of Aurora 2 Task
4. IMPLEMENTATION ISSUES IN PROPOSED ALGORITHMS
In this section we will discuss a number of implementation issues regarding our proposed
methods. Also in this section, we will consider the overestimation parameter to enable us better
estimate the noise autocorrelation sequence.
4.1. Considering Cross Correlation
To consider the cross correlation term, we have implemented two different methods, as discussed
in Subsections 3.2.1 and 3.2.2. The procedure for feature extraction in our proposed methods is as
follows.
a) Frame Blocking and pre-emphasis.
b) Hamming windowing.
c) Calculation of unbiased autocorrelation sequence of noisy speech signal.
d) Estimation of noise autocorrelation sequence in each utterance and subtracting it from the
speech signal autocorrelation sequence in each frame of the utterance (More details of
parameter settings will be found in Section 3).
e) Kernel function computation (see Subsection 3.2.1) or autocorrelation averaging (see
Subsection 3.2.2).
f) Inserting cross correlation term in the estimation of the autocorrelation of the clean
speech.
g) Fast Fourier Transform (FFT) calculation.
h) Calculation of the logarithms of Mel-frequency filter bank outputs.
14. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
36
i) Application of DCT to the sequence resulting from pervious step.
j) Calculation of the feature vectors including 12 cepstral and a log-energy parameter and
their first and second order dynamic parameters.
In these algorithms, almost all the steps are rather straightforward. Only steps e) and f) are added
to our implementation of ANS, which are related to inclusion of the cross correlation term. The
accuracy of the cross correlation term estimation would be crucial at this stage. The results of our
implementations will be given in Section 5.
4.2. Considering Overestimation Parameter
Since our algorithm is applied to the autocorrelation of the noisy signal, the flooring parameter
used in spectral subtraction will not be needed in the application of our algorithm. The reason is
that flooring in spectral subtraction is usually needed to remove the negative spectral values,
while this would not be a problem in autocorrelation domain. As shown in Figure 5, in the
autocorrelation sequence of noise, valleys and peaks may be observed whose lag locations and
magnitudes might vary from one frame to another.
Although smoothed to some extent, such perhaps unrealistic peaks and valleys might still show
up in our estimate of the noise autocorrelation sequence. By subtracting the noise autocorrelation
sequence from that of the noisy speech, some peaks and valleys will be added to the estimated
clean speech autocorrelation sequence, resulted from valleys and peaks in the estimated noise
autocorrelation sequence. In order to decrease the effects of these peaks and valleys, we have
used an overestimation parameter by modifying the ANS equation to
where α ≥ 1 is the overestimation parameter. Note that when α = 1, equation (29) is identical to
the equation used for ANS. Apparently, having α > 1 leads to some attenuation in the peaks of the
estimated clean speech signal autocorrelation, due to increase in the last term of equation (29).
Various values of α were tested to get the best result on the Aurora 2 task.
Figure 5. Autocorrelation sequences for 5 consecutive frames of factory noise.
In order to reduce the speech distortions caused by large values of α, we have changed this
parameter with SNR [23]. The SNR was calculated frame by frame as explained in the parameter
15. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
37
setting Section 4.3. Figure 6 shows the trend of change we used for parameter α with SNR.
Clearly, with increasing SNR, the values of α should decrease and vice versa. The trend of this
change was set to linear, as shown in the figure, according to changes observed in system
recognition performance in practice [17]. We tested the proposed method with/without taking into
account the signal/noise cross correlation. If we consider the issue of cross correlation, as
explained in Section 3.2.1, together with the overestimation parameter, the following relationship
for clean speech signal estimation will result
Meanwhile, considering the cross correlation term as in Section 3.2.2, together with the
overestimation parameter, we will have the following equation, which gives an approximate value
of the speech signal.
Figure 6. Change in the parameter α with SNR on Aurora 2 task.
4.3. Parameter Settings
In our implementation of RAS, the length of the filter was set to L=2 according to reference [9].
Also the duration for lower lag elimination in the AMFCC method was set to 2.5 ms (20 samples
in 8 kHz sampling rate for Aurora 2 task) similar to reference [10]. The same duration was also
used for SPFH implementation [16]. In order to estimate the noise autocorrelation sequence, in all
our experiments, we have used 20 initial frames of each utterance, considering them as non-
speech sections. As shown in reference [24], this number of frames resulted in best recognition
rates on Aurora 2 task.
In the implementation of ANSS, in order to get the best results, we have tried different numbers
of frames (T in equations (21) to (26)) for averaging. Figure 7 shows the results. As depicted, the
grand average recognition rates on the three sets of Aurora 2 task have shown the best
performance with 3 frames used in autocorrelation averaging. Therefore, in our experiments, we
have used this number for noisy speech autocorrelation averaging. Regarding bi, in our
experiments, simple averaging was carried out. In the implementations using overestimation
parameter, this parameter was changed as a function of SNR in each frame. An estimate of SNR
in each frame was found as
16. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
38
Figure 7. Normalized recognition rates for test sets of Aurora 2 task versus the number of frames used in
noise autocorrelation sequence averaging and their grand average.
5. EXPERIMENTS
In this section, we will describe the data used and procedures followed in our experiments. Our
implementations include some of the previous methods for comparison purposes as well as our
proposed approaches.
5.1. Data
The experiments were carried out on Aurora 2 task [25]. The features in this case were computed
using 25 msec. frames with 10 msec. of frame shifts. The pre-emphasis coefficient was set to
0.97. For each speech frame, a 23-channel mel-scale filter-bank was used. The feature vectors for
proposed methods were composed of 12 cepstral and a log-energy parameter, together with their
first and second order derivatives. All model creation, training and tests in all our experiments
have been carried out using the standard Hidden Markov model toolkit [26] with 16 states and 3
mixture components per state. The HMMs were trained in clean condition, i.e. with clean training
data.
5.2. Implementation Results using Cross Correlation Terms
The setting of our parameters is as described in 4.3. Figure 8 includes ANS, Kernel and ANSS
recognition results on the Aurora 2 data. Also, for comparison purposes, the results of baseline
MFCC, together with RAS, AMFCC and MFCC-SS are included. RAS and AMFCC were chosen
as two of the most successful autocorrelation-based methods. Also note that the parameters used
in MFCC-SS are the same used in the implementation of spectral subtraction explained in Section
2.3. While the results of ANS, Kernel and ANSS show considerable improvement over the
17. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
39
baseline MFCC in noisy conditions, ANSS has shown superior performance in comparison to
ANS and Kernel methods. In fact, ANSS has performed quite well, outperforming the standard
MFCC with a very large margin, especially in lower SNRs, reaching a value of up to 35%
absolute reduction in word error rate. In comparison to ANS, which itself performs satisfactorily
in noisy conditions, the higher performance of ANSS is noticeable. A prompt conclusion could be
that including the effect of noise-signal cross correlation in autocorrelation-based noise
subtraction method can further improve the performance boundaries of this method. This is
indicative of the effectiveness of inserting the cross correlation parameter into the autocorrelation
calculation of noisy speech signal.
5.3. Implementation Results of Applying Overestimation Parameter
The results of including the overestimation parameter α into clean speech autocorrelation
estimation procedure will be reported here. Figure 9 depicts our recognition results on Aurora 2
Task. The naming conventions for our methods are as before with OEP being added to indicate
the inclusion of the overestimation parameter in the implementation. As it is clear, the application
of overestimation has led to improvements in the system recognition performance in almost all
cases. This indicates the potential of the overestimation parameter in improving autocorrelation-
based noise subtraction.
5.4. Comparison of the Discussed Methods
In order to reach to an overall conclusion on different methods discussed, we wish to compare the
performances of all the mentioned methods on the specified task. Furthermore, as mentioned in
reference [27], using the normalized energy instead of the logarithm energy, together with mean
and variance normalization of the cepstral parameters, could lead to improvement in the speech
recognition performance in noisy conditions. Therefore, we have also applied this technique
which has further improved the recognition rate of our best method discussed, ANSS+OEP. Table
2 shows the average recognition rates of all these methods on the Aurora 2 task. As usual in
Aurora 2 result calculations, the -5dB and clean results are not included in the averaging.
Furthermore, the percentage of relative improvement of each method in comparison to the
baseline MFCC is also mentioned. We have also included two other test results in this table;
MFCC enhanced with spectral subtraction (MFCC-SS) and mean subtraction, variance
normalization and ARMA filtering (MVA) [28]. The former is meant to show the performance
improvement obtained by spectral subtraction, as a basic enhancement approach on this task
while the latter is just added as a rather simple method known to perform among the best in
robust speech recognition. The implementation procedure was exactly similar to our other tests.
Also, in this table, for comparison purposes, the results obtained from the application of ETSI
Extended Advanced Front-End [29] on the Aurora2 corpus are reported. This is a standard front-
end which uses sophisticated enhancement approaches to improve the quality of the extracted
features.
18. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
40
(a)
(b)
(c)
Figure 8. Average recognition rates of MFCC, RAS, AMFCC, ANS, Kernel, ANSS and MFCC-SS on
Aurora 2 task. (a) Test set A, (b) Test set B and (c) Test set C.
(a)
19. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
41
(b)
(c)
Figure 9. Average recognition rates of ANSS+OEP, Kernel+OEP, ANS+OEP and ANS approaches on
Aurora 2. (a) Test set A, (b) Test set B, and (c) Test set C.
As expected, by improving more advanced methods in the autocorrelation domain, i.e. DAS,
SPFH and ANS using our proposed methods, better results were obtained in comparison to
somewhat more basic autocorrelation-based methods, i.e. RAS and AMFCC. As it is clear, the
combination of ANSS and overestimation with energy and cepstral mean and variance
normalization (EMVN), overcame all other proposed methods in average overall performance on
all the three test sets of AURORA 2. It is also worth mentioning that this performance is obtained
with simple and low complexity computations, while ETSI-XAFE is a complicated algorithm
with large computational overhead. Also, it is worth mentioning that, as will be shown in the
appendix, the strongest advantage of the proposed methods over the ETSI-XAFE is at very low
SNRs (-5dB in this case), which is not included in the figures reported in Table 2.
6. CONCLUSIONS
In this paper, we have raised the issue of using autocorrelation-based noise estimation and
subtraction, taking into account the cross correlation term error. Two different methods were
introduced for the insertion of the cross correlation term into the estimation of clean speech
autocorrelation sequence, namely Kernel and ANSS. The Kernel method inserts the cross
correlation term using a kernel function whereas ANSS considers the cross correlation term by
averaging on a few frames. Both approaches were tested on Aurora 2 task and proved to be useful
20. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
42
in further improving the ANS results. Also, the overestimation parameter, as an important
parameter where autocorrelation sequence estimation is concerned, was taken into account.
Practical experiments indicated that even better recognition performance could be expected when
the overestimation parameter was introduced to ANS, Kernel, and ANSS methods. According to
these results, although all the methods performed better when implemented in conjunction with
the overestimation parameter, ANSS with overestimation parameter (ANSS+OEP) performed the
best among them and its combination with energy and cepstral mean and variance normalization
performed even better than the ETSI-XAFE. Altogether, a major result is that the features
extracted from the autocorrelation sequence of the speech signal perform rather well in the
presence of noise and the so-called mismatch conditions.
Table 2. Comparison of Average Recognition Rates and Percentage of Improvement in Comparison to
MFCC for Various Feature Types on Three Test Sets of Aurora 2 Task
REFERENCES
[1] Boll, S., (1979), “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. on
Acoustics, Speech and Signal Processing, Vol. ASSP-27, No. 2, pp 113-120.
[2] Evans, N. W. D., Mason, J. S. D., Liu, W. M. & Fauve, B., (2006). “An assessment on the
fundamental limitations of spectral subtraction”. Proceedings of ICASSP.
[3] McGinn, D.P. & Johnson, D.H., (1989), “Estimation of all-pole model parameters from noise-
corrupted sequence”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37, No. 3,
pp 433-436.
[4] Jalil, M. Butt & Malik, A., (2013), “Short-time energy, magnitude, zero crossing rate and
autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals”,
Proceeding of TAEECE, pp 208-212.
21. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
43
[5] Amado, R.-G. & Filho, J.-V., (2008), “Pitch detection algorithms based on zero-cross rate and
autocorrelation function for musical notes”, Proceeding of ICALIP, pp 449 – 454.
[6] Bansal Dev, P.,A. & BalaJain, S., (2009), “Role of Spectral Peaks in Autocorrelation Domain for
Robust Speech Recognition”, Journal of Computing and Information Technology-CIT17, Vol. 3, pp
295–303.
[7] Mansour, D. & Juang, B.-H., (1989), “The short-time modified coherence representation and noisy
speech recognition”, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37, No. 6,
pp 795-804.
[8] Hernando, J. & Nadeu, C., (1997), “Linear prediction of the one-sided autocorrelation sequence for
noisy speech recognition, IEEE Transactions on Speech and Audio Processing”, Vol. 5, No. 1, pp 80-
84.
[9] Yuo K.-H. & Wang, H.-C., (1999), “Robust features for noisy speech recognition based on temporal
trajectory filtering of short-time autocorrelation sequences”, Speech Communication, Vol. 28, pp 13-
24.
[10] Shannon, B. J. & Paliwal, K. K., (2006). “Feature extraction from higher-lag autocorrelation
coefficients for robust speech recognition”, Speech Communication, Vol. 48, pp 1458–1485.
[11] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2007), “Features based on filtering and spectral
peaks in autocorrelation domain for robust speech recognition”, Computer Speech and Language,
Vol. 21, pp 187-205.
[12] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2006), “Use of spectral peaks in autocorrelation
and group delay domains for robust speech recognition”, Proceeding of ICASSP.
[13] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2006), “Robust Feature Extraction based on
Spectral Peaks of Group Delay and Autocorrelation Function and Phase Domain Analysis”,
Proceeding of ICSLP.
[14] Ikbal, S., Misra, H. & Bourlard, H., (2003), “Phase autocorrelation (PAC) derived robust speech
features”, Proceeding of ICASSP, pp 133-136.
[15] Hu, Y., Bhatnagar, M. & Loizou, P., (2001), “A Cross-correlation Technique for Enhancing Speech
Corrupted with Correlated Noise”, Proceeding of ICASSP.
[16] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2006), “Robust feature extraction using spectral
peaks of the filtered higher lag autocorrelation sequence of the speech signal”, Proceeding of ISSPIT.
[17] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2006), “Robust Feature Extraction of Speech
via Noise Reduction in Autocorrelation Domain”, Proceeding of IWMRCS.
[18] SPIB, (1995), SPIB noise data. Available from: <http://spib.rice.edu/spib/select_noise.html>.
[19] Lu, Y. & Loizou, P.C., (2008), “A geometric approach to spectral subtraction”, Speech
Communication, Vol. 50, pp 453–466.
[20] Onoe, K., Segi, H., Kobayakawa, T., Sato, S., Imai, T. & Ando, A., (2002), “Filter Bank Subtraction
for Robust Speech Recognition”, Proceeding of ICSLP.
[21] Kitaoka, N. & Nakagawa, S., (2002), “Evaluation of spectral subtraction with smoothing of time
direction on the AURORA 2 task”, Proceeding of ICSLP, pp 477-480.
[22] Vasseghi, S.V., (2006), Advanced Digital Signal Processing and Noise Reduction, 3rd Edition, John
Wiley & Sons Ltd.
[23] Kamath, S. D. & Loizou, P. C., (2002), “A Multi-Band Spectral Subtraction Method for Enhancing
Speech Corrupted by Colored Noise”, Proceeding of ICASSP.
[24] Farahani, G., Ahadi, S.M. & Homayounpour, M.M., (2007), “Improved Autocorrelation-based Noise
Robust Speech Recognition Using Kernel-Based Cross Correlation and Overestimation Parameters”,
Proceeding of EUSIPCO.
[25] Hirsch, H.G. & Pearce, D., (2000), “The Aurora experimental framework for the performance
evaluation of speech recognition systems under noisy conditions”, Proceeding of ISCA ITRW ASR.
[26] HTK, (2002), The hidden Markov model toolkit available from: <http://htk.eng.cam.ac.uk>.
[27] Ahadi, S. M., Sheikhzadeh, H., Brennan, R. L. & Freeman, G. H., (2004), “An Energy Scheme for
Improved Robustness in Speech Recognition”, Proceeding of ICSLP.
[28] Chen, C.-P, Bilmes, J. & Kirchhoff, K., (2002), “Low-Resource Noise-Robust Feature Post-
Processing on Aurora 2.0”, Proceeding of ICSLP, pp 2445-2448.
22. Signal & Image Processing : An International Journal (SIPIJ) Vol.8, No.1, February 2017
44
[29] ETSI-XAFE, (2003), ETSI Standard, Extended advanced front-end feature extraction algorithm -
ETSI ES 202 212 V1.1.1.
AUTHOR
Gholamreza Farahani received his BSc degree in electrical engineering from Sharif
University of Technology, Tehran, Iran, in 1998 and MSc and PhD degrees in
electrical engineering from Amirkabir University of Technology (Polytechnic),
Tehran, Iran in 2000 and 2006 respectively. Currently, he is an assistant professor in
the Institute of Electrical and Information Technology, Iranian Research Organization
for Science and Technology (IROST), Iran. His research interest is signal processing
especially speech processing