Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNI...cscpconf
Analysis of speech for recognition of stress is important for identification of emotional state of
person. This can be done using ‘Linear Techniques’, which has different parameters like pitch,
vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction
of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of
features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German
database. Here emotion recognition is done for different emotions like neutral, happy, disgust,
sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
VOWEL PHONEME RECOGNITION BASED ON AVERAGE ENERGY INFORMATION IN THE ZEROCROS...ijistjournal
Speech signal is modelled using the average energy of the signal in the zerocrossing intervals. Variation of these energies in the zerocrossing interval of the signal is studied and the distribution of this parameter through out the signal is evaluated. It is observed that the distribution patterns are similar for repeated utterances of the same vowels and varies from vowel to vowel. Credibility of the proposed parameter is verified over five Malayalam (one of the most popular Indian language) vowels using multilayer feed forward artificial neural network based recognition system. The performance of the system using additive white Gaussian noise corrupted speech is also studied for different SNR levels. From the experimental results it is evident that the average energy information in the zerocrossing intervals and its distributions can be effectively utilised for vowel phone classification and recognition.
International Journal of Engineering Research and Applications (IJERA) aims to cover the latest outstanding developments in the field of all Engineering Technologies & science.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNI...cscpconf
Analysis of speech for recognition of stress is important for identification of emotional state of
person. This can be done using ‘Linear Techniques’, which has different parameters like pitch,
vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction
of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of
features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German
database. Here emotion recognition is done for different emotions like neutral, happy, disgust,
sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
VOWEL PHONEME RECOGNITION BASED ON AVERAGE ENERGY INFORMATION IN THE ZEROCROS...ijistjournal
Speech signal is modelled using the average energy of the signal in the zerocrossing intervals. Variation of these energies in the zerocrossing interval of the signal is studied and the distribution of this parameter through out the signal is evaluated. It is observed that the distribution patterns are similar for repeated utterances of the same vowels and varies from vowel to vowel. Credibility of the proposed parameter is verified over five Malayalam (one of the most popular Indian language) vowels using multilayer feed forward artificial neural network based recognition system. The performance of the system using additive white Gaussian noise corrupted speech is also studied for different SNR levels. From the experimental results it is evident that the average energy information in the zerocrossing intervals and its distributions can be effectively utilised for vowel phone classification and recognition.
International Journal of Engineering Research and Applications (IJERA) aims to cover the latest outstanding developments in the field of all Engineering Technologies & science.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...CSCJournals
In this paper a new thresholding based speech enhancement approach is presented, where the threshold is statistically determined by employing the Teager energy operation on the Wavelet Packet (WP) coefficients of noisy speech. The threshold thus obtained is applied on the WP coefficients of the noisy speech by using a hard thresholding function in order to obtain an enhanced speech. Detailed simulations are carried out in the presence of white, car, pink, and babble noises to evaluate the performance of the proposed method. Standard objective measures, spectrogram representations and subjective listening tests show that the proposed method outperforms the existing state-of-the-art thresholding based speech enhancement approaches for noisy speech from high to low levels of SNR.
Improvement of minimum tracking in Minimum Statistics noise estimation methodCSCJournals
Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper we propose a new method for minimum tracking in Minimum Statistics (MS) noise estimation method. This noise estimation algorithm is proposed for highly nonstationary noise environments. This was confirmed with formal listening tests which indicated that the proposed noise estimation algorithm when integrated in speech enhancement was preferred over other noise estimation algorithms.
This paper describe the morphing concept in which we convert the voice of any person into pre -analyzed or pre-recorded voice of any animals.As the user generate a pre-established voice, his pitch, timbre, vibrato and articulation can be modified to resemble those of a pre-recorded and pre-analyzed voice of animal. This technique is based on SMS. Thus using this concept we can develop many funny application and we can used this type of application in mobile device, personal computer etc. for enjoying the sometime of period.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
Usually, hearing impaired people use hearing aids which are implemented with speech enhancement
algorithms. Estimation of speech and estimation of nose are the components in single channel speech
enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power
spectrum for non stationary environment. VAD (Voice Activity Detector) is used to identify speech pauses
and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error) speech
enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual
aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio) based on uncertainty
parameter introduced for the benefits of hearing impaired people in non stationary environments to control
distortions. By estimation and updating of noise based on division of original pure signal into three parts
such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different
values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed
method will compared with any one method WAT(Weighted Average Technique) Hence by using
parameters SR (signal to residual spectrum ratio) and LLR (log like hood ratio), MMSE (Minim Mean
Square Error) in terms of segmented SNR and LLR.
A REVIEW OF LPC METHODS FOR ENHANCEMENT OF SPEECH SIGNALSijiert bestjournal
This paper presents a review of LPC methods for enhancement of speech signals. The purpose of all method of speech enhancement is to impr ove quality of speech signal by minimizing the background noise . This paper especially comments on Power Spectral Subtraction,Multiband Spectral Subtraction,Non - Linear Spectral Subtraction Method,MMSE Spectral Subtraction Method and Spectral Subtraction based on perceptual properties. As the spectral subtraction produces the Residual noise,Musical noise,Linear Predictive Analysis method is used to enhance the Speech.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
This paper is aimed to reduce background noise introduced in speech signal during capture, storage, transmission and processing using Spectral Subtraction algorithm. To consider the fact that colored noise corrupts the speech signal non-uniformly over different frequency bands, Multi-Band Spectral Subtraction (MBSS) approach is exploited wherein amount of noise subtracted from noisy speech signal is decided by a weighting factor. Choice of optimal values of weights decides the performance of the speech enhancement system. In this paper weights are decided based on SFM (Spectral Flatness Measure) than conventional SNR (Signal to Noise Ratio) based rule. Since SFM is able to provide true distinction between speech signal and noise signal. Spectrogram, Mean Opinion Score show that speech enhanced from proposed SFM based MBSS possess better perceptual quality and improved intelligibility than existing SNR based MBSS
Speech signal analysis for linear filter banks of different orderseSAT Journals
Abstract In speech signal processing using of filter banks is very important. The critical requirement is the sum of all the frequency responses of the band-pass filters of the filter bank i.e. composite frequency response be flat with linear phase. This paper deals with design and implementation of linear-phase FIR digital filters based filter bank flat with flat composite frequency response. The design is based on special properties of FIR filters by which excellent frequency response can be achieved. Keyword: Speech signal, FIR Filter, Composite frequency response.
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing IJECEIAES
The speech enhancement algorithms are utilized to overcome multiple limitation factors in recent applications such as mobile phone and communication channel. The challenges focus on corrupted speech solution between noise reduction and signal distortion. We used a modified Wiener filter and compressive sensing (CS) to investigate and evaluate the improvement of speech quality. This new method adapted noise estimation and Wiener filter gain function in which to increase weight amplitude spectrum and improve mitigation of interested signals. The CS is then applied using the gradient projection for sparse reconstruction (GPSR) technique as a study system to empirically investigate the interactive effects of the corrupted noise and obtain better perceptual improvement aspects to listener fatigue with noiseless reduction conditions. The proposed algorithm shows an enhancement in testing performance evaluation of objective assessment tests outperform compared to other conventional algorithms at various noise type conditions of 0, 5, 10, 15 dB SNRs. Therefore, the proposed algorithm significantly achieved the speech quality improvement and efficiently obtained higher performance resulting in better noise reduction compare to other conventional algorithms.
Comparative performance analysis of channel normalization techniqueseSAT Journals
Abstract A major part of the interaction between humans takes place via speech communication. The speech signal carries both useful and unwanted information. Processing of such signals involve enhancing the useful information. The intelligibility of speech signals is significantly reduced due to the presence of unwanted information such as noise. Channel normalization algorithms suppress such additive noise introduced in the speech signals by transmission channel or by recording environment conditions. Enhancing the quality and intelligibility of speech signals improve the performance of speech systems such as Automatic speech recognition (ASR) , voice communication and hearing aids to name the few. Based on the experimental results the comparative analysis of channel normalization techniques have been presented in this paper to find out the most suitable algorithm for enhancing the speech signals. Keywords: Cepstral Mean Normalization, Spectral Subtraction, Weiner filter, Signal to Noise Ratio
Recently, many studies are carried out with inspirations from ecological phenomena for developing
optimization techniques. The new algorithm that is motivated by a common phenomenon in agriculture is
colonization of invasive weeds. In this paper, a modified invasive weed optimization (IWO) algorithm is
presented for optimization of multiobjective flexible job shop scheduling problems (FJSSPs) with the
criteria to minimize the maximum completion time (makespan), the total workload of machines and the
workload of the critical machine. IWO is a bio-inspired metaheuristic that mimics the ecological behaviour
of weeds in colonizing and finding suitable place for growth and reproduction. IWO is developed to solve
continuous optimization problems that’s why the heuristic rule the Smallest Position Value (SPV) is used to
convert the continuous position values to the discrete job sequences. The computational experiments show
that the proposed algorithm is highly competitive to the state-of-the-art methods in the literature since it is
able to find the optimal and best-known solutions on the instances studied.
In this study the controller for three tank multi loop system is designed using coefficient Diagram
method. Coefficient Diagram Method is one of the polynomial methods in control design. The
controller designed by using CDM technique is based on the coefficients of the characteristics
polynomial of the closed loop system according to the convenient performance such as equivalent
time constant, stability indices and stability limit. Controller is designed for the three tank process
by using CDM Techniques; the simulation results show that the proposed control strategies have
good set point tracking and better response capability.
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...CSCJournals
In this paper a new thresholding based speech enhancement approach is presented, where the threshold is statistically determined by employing the Teager energy operation on the Wavelet Packet (WP) coefficients of noisy speech. The threshold thus obtained is applied on the WP coefficients of the noisy speech by using a hard thresholding function in order to obtain an enhanced speech. Detailed simulations are carried out in the presence of white, car, pink, and babble noises to evaluate the performance of the proposed method. Standard objective measures, spectrogram representations and subjective listening tests show that the proposed method outperforms the existing state-of-the-art thresholding based speech enhancement approaches for noisy speech from high to low levels of SNR.
Improvement of minimum tracking in Minimum Statistics noise estimation methodCSCJournals
Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper we propose a new method for minimum tracking in Minimum Statistics (MS) noise estimation method. This noise estimation algorithm is proposed for highly nonstationary noise environments. This was confirmed with formal listening tests which indicated that the proposed noise estimation algorithm when integrated in speech enhancement was preferred over other noise estimation algorithms.
This paper describe the morphing concept in which we convert the voice of any person into pre -analyzed or pre-recorded voice of any animals.As the user generate a pre-established voice, his pitch, timbre, vibrato and articulation can be modified to resemble those of a pre-recorded and pre-analyzed voice of animal. This technique is based on SMS. Thus using this concept we can develop many funny application and we can used this type of application in mobile device, personal computer etc. for enjoying the sometime of period.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...sipij
Usually, hearing impaired people use hearing aids which are implemented with speech enhancement
algorithms. Estimation of speech and estimation of nose are the components in single channel speech
enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power
spectrum for non stationary environment. VAD (Voice Activity Detector) is used to identify speech pauses
and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error) speech
enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual
aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio) based on uncertainty
parameter introduced for the benefits of hearing impaired people in non stationary environments to control
distortions. By estimation and updating of noise based on division of original pure signal into three parts
such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different
values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed
method will compared with any one method WAT(Weighted Average Technique) Hence by using
parameters SR (signal to residual spectrum ratio) and LLR (log like hood ratio), MMSE (Minim Mean
Square Error) in terms of segmented SNR and LLR.
A REVIEW OF LPC METHODS FOR ENHANCEMENT OF SPEECH SIGNALSijiert bestjournal
This paper presents a review of LPC methods for enhancement of speech signals. The purpose of all method of speech enhancement is to impr ove quality of speech signal by minimizing the background noise . This paper especially comments on Power Spectral Subtraction,Multiband Spectral Subtraction,Non - Linear Spectral Subtraction Method,MMSE Spectral Subtraction Method and Spectral Subtraction based on perceptual properties. As the spectral subtraction produces the Residual noise,Musical noise,Linear Predictive Analysis method is used to enhance the Speech.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
This paper is aimed to reduce background noise introduced in speech signal during capture, storage, transmission and processing using Spectral Subtraction algorithm. To consider the fact that colored noise corrupts the speech signal non-uniformly over different frequency bands, Multi-Band Spectral Subtraction (MBSS) approach is exploited wherein amount of noise subtracted from noisy speech signal is decided by a weighting factor. Choice of optimal values of weights decides the performance of the speech enhancement system. In this paper weights are decided based on SFM (Spectral Flatness Measure) than conventional SNR (Signal to Noise Ratio) based rule. Since SFM is able to provide true distinction between speech signal and noise signal. Spectrogram, Mean Opinion Score show that speech enhanced from proposed SFM based MBSS possess better perceptual quality and improved intelligibility than existing SNR based MBSS
Speech signal analysis for linear filter banks of different orderseSAT Journals
Abstract In speech signal processing using of filter banks is very important. The critical requirement is the sum of all the frequency responses of the band-pass filters of the filter bank i.e. composite frequency response be flat with linear phase. This paper deals with design and implementation of linear-phase FIR digital filters based filter bank flat with flat composite frequency response. The design is based on special properties of FIR filters by which excellent frequency response can be achieved. Keyword: Speech signal, FIR Filter, Composite frequency response.
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing IJECEIAES
The speech enhancement algorithms are utilized to overcome multiple limitation factors in recent applications such as mobile phone and communication channel. The challenges focus on corrupted speech solution between noise reduction and signal distortion. We used a modified Wiener filter and compressive sensing (CS) to investigate and evaluate the improvement of speech quality. This new method adapted noise estimation and Wiener filter gain function in which to increase weight amplitude spectrum and improve mitigation of interested signals. The CS is then applied using the gradient projection for sparse reconstruction (GPSR) technique as a study system to empirically investigate the interactive effects of the corrupted noise and obtain better perceptual improvement aspects to listener fatigue with noiseless reduction conditions. The proposed algorithm shows an enhancement in testing performance evaluation of objective assessment tests outperform compared to other conventional algorithms at various noise type conditions of 0, 5, 10, 15 dB SNRs. Therefore, the proposed algorithm significantly achieved the speech quality improvement and efficiently obtained higher performance resulting in better noise reduction compare to other conventional algorithms.
Comparative performance analysis of channel normalization techniqueseSAT Journals
Abstract A major part of the interaction between humans takes place via speech communication. The speech signal carries both useful and unwanted information. Processing of such signals involve enhancing the useful information. The intelligibility of speech signals is significantly reduced due to the presence of unwanted information such as noise. Channel normalization algorithms suppress such additive noise introduced in the speech signals by transmission channel or by recording environment conditions. Enhancing the quality and intelligibility of speech signals improve the performance of speech systems such as Automatic speech recognition (ASR) , voice communication and hearing aids to name the few. Based on the experimental results the comparative analysis of channel normalization techniques have been presented in this paper to find out the most suitable algorithm for enhancing the speech signals. Keywords: Cepstral Mean Normalization, Spectral Subtraction, Weiner filter, Signal to Noise Ratio
Recently, many studies are carried out with inspirations from ecological phenomena for developing
optimization techniques. The new algorithm that is motivated by a common phenomenon in agriculture is
colonization of invasive weeds. In this paper, a modified invasive weed optimization (IWO) algorithm is
presented for optimization of multiobjective flexible job shop scheduling problems (FJSSPs) with the
criteria to minimize the maximum completion time (makespan), the total workload of machines and the
workload of the critical machine. IWO is a bio-inspired metaheuristic that mimics the ecological behaviour
of weeds in colonizing and finding suitable place for growth and reproduction. IWO is developed to solve
continuous optimization problems that’s why the heuristic rule the Smallest Position Value (SPV) is used to
convert the continuous position values to the discrete job sequences. The computational experiments show
that the proposed algorithm is highly competitive to the state-of-the-art methods in the literature since it is
able to find the optimal and best-known solutions on the instances studied.
In this study the controller for three tank multi loop system is designed using coefficient Diagram
method. Coefficient Diagram Method is one of the polynomial methods in control design. The
controller designed by using CDM technique is based on the coefficients of the characteristics
polynomial of the closed loop system according to the convenient performance such as equivalent
time constant, stability indices and stability limit. Controller is designed for the three tank process
by using CDM Techniques; the simulation results show that the proposed control strategies have
good set point tracking and better response capability.
In order to cope with real-world problems more effectively, we tend to design a decision support system for tuberculosis bacterium class identification. In this paper, we are concerned to propose a fuzzy diagnosability approach, which takes value between {0, 1} and based on observability of events, we
formalized the construction of diagnoses that are used to perform diagnosis. In particular, we present a
framework of the fuzzy expert system; discuss the suitability of artificial intelligence as a novel soft paradigm and reviews work from the literature for the development of a medical diagnostic system. The newly proposed approach allows us to deal with problems of diagnosability for both crisp and fuzzy value of input data. Accuracy analysis of designed decision support system based on demographic data was done
by comparing expert knowledge and system generated response. This basic emblematic approach using
fuzzy inference system is presented that describes a technique to forecast the existence of bacterium and
provides support platform to pulmonary researchers in identifying the ailment effectively.
An artificial neural network model for classification of epileptic seizures u...ijsc
Epilepsy is one of the most common neurological disorders characterized by transient and unexpected
electrical disturbance in the brain. In This paper
the EEG signals are decomposed into a finite set of
bandlimited signals termed as Intrinsic mode functions.
The Hilbert transom is applied on these IMF’s to
calculate instantaneous frequencies. The 2nd,3rd an
d 4th IMF's are used to extract features of epilepticsignal. A neural network using back propagation alg
orithm is implemented for classification of epilepsy.An overall accuracy of 99.8% is achieved in classification..
Integrating vague association mining with markov modelijsc
The increasing demand of World Wide Web raises the need of predicting the user’s web page request. The
most widely used approach to predict the web pages is the pattern discovery process of Web usage mining.
This process involves inevitability of many techniques like Markov model, association rules and clustering.
Fuzzy theory with different techniques has been introduced for the better results. Our focus is on Markov
models. This paper is introducing the vague Rules with Markov models for more accuracy using the vague
set theory.
EXPLORING SOUND SIGNATURE FOR VEHICLE DETECTION AND CLASSIFICATION USING ANNijsc
This paper attempts to explore the possibility of using sound signatures for vehicle detection and
classification purposes. Sound emitted by vehicles are captured for a two lane undivided road carrying
moderate traffic. Simultaneous arrival of different types vehicles, overtaking at the study location, sound of
horns, random but identifiable back ground noises, continuous high energy noises on the back ground are
the different challenges encountered in the data collection. Different features were explored out of which
smoothed log energy was found to be useful for automatic vehicle detection by locating peaks. Melfrequency
ceptral coefficients extracted from fixed regions around the detected peaks along with the
manual vehicle labels are utilised to train an Artificial Neural Network (ANN). The classifier for four
broad classes heavy, medium, light and horns was trained. The ANN classifier developed was able to
predict categories well.
The increasing popularity of animes makes it vulnerable to unwanted usages like copyright violations and
pornography. That’s why, we need to develop a method to detect and recognize animation characters. Skin
detection is one of the most important steps in this way. Though there are some methods to detect human
skin color, but those methods do not work properly for anime characters. Anime skin varies greatly from
human skin in color, texture, tone and in different kinds of lighting. They also vary greatly among
themselves. Moreover, many other things (for example leather, shirt, hair etc.), which are not skin, can
have color similar to skin. In this paper, we have proposed three methods that can identify an anime
character’s skin more successfully as compared with Kovac, Swift, Saleh and Osman methods, which are
primarily designed for human skin detection. Our methods are based on RGB values and their comparative
relations.
PRIVACY ENHANCEMENT OF NODE IN OPPORTUNISTIC NETWORK BY USING VIRTUAL-IDijsc
An entrepreneurial system is one of the sort of remote system. Delay resistance system is correspondence
organizing proposition which empowers the correspondence in such a situation where end to end way
might never be exist. Message is forward on the premise of chance. Time interim to convey a message is
long we can't evaluate or anticipate the time until we get the message. There is a security issue in these
sorts of system. In this paper we will proposed another procedure which will expand the protection of the
system and build execution of the system.
PERFUSION SYSTEM CONTROLLER STRATEGIES DURING AN ECMO SUPPORTijsc
In this work modelling and control of Perfusion system is presented. The Perfusion system simultaneously
controls the partial pressures during Extra Corporeal Membrane Oxygenation (ECMO) support. The
main Problem in ECMO system is exchange of Blood Gases in the Artificial Lung (Oxygenator).It is a
highly Nonlinear Process comprising time-varying parameters, and varying time delays, it is currently
being controlled manually by trained Perfusionist. The new control strategy implemented here has a
feedback linearization routine with time-delay compensation for the Partial pressuresof Oxygen and
Carbon dioxide. The controllers were tuned robustly and tested in simulations with a detailed artificial
Lung (Oxygenator) model in Cardiopulmonary bypass conditions. This Automatic control strategy is
proposed to improve the patient’s safety by fast control reference tracking and good disturbance rejection
under varying conditions.
METAHEURISTIC OPTIMIZATION ALGORITHM FOR THE SYNCHRONIZATION OF CHAOTIC MOBIL...ijsc
We provide a scheme for the synchronization of two chaotic mobile robots when a mismatch between the
parameter values of the systems to be synchronized is present. We have shown how meta-heuristic
optimization can be used to adapt the parameters in two coupled systems such that the two systems are
synchronized, although their behavior is chaotic and they have started with different initial conditions and
parameter settings. The controlled system synchronizes its dynamics with the control signal in the periodic
as well as chaotic regimes. The method can be seen also as another way of controlling the chaotic
behavior of a coupled system. In the case of coupled chaotic systems, under the interaction between them,
their chaotic dynamics can be cooperatively self-organized. A synergistic approach to meta-heuristic
optimization search algorithm is developed. To avoid being trapped into local optimum and to enrich the
searching behavior, chaotic dynamics is incorporated into the proposed search algorithm. A chaotic Levy
flight is firstly incorporated in the proposed search algorithm for efficiently generating new solutions. And
secondly, chaotic sequence and a psychology factor of emotion are introduced for move acceptance in the
search algorithm. We illustrate the application of the algorithm by estimating the complete parameter
vector of a chaotic mobile robot.
HYBRID PARTICLE SWARM OPTIMIZATION FOR SOLVING MULTI-AREA ECONOMIC DISPATCH P...ijsc
We consider the Multi-Area Economic Dispatch problem (MAEDP) in deregulated power system
environment for practical multi-area cases with tie line constraints. Our objective is to generate allocation
to the power generators in such a manner that the total fuel cost is minimized while all operating
constraints are satisfied. This problem is NP-hard. In this paper, we propose Hybrid Particle Swarm
Optimization (HGAPSO) to solve MAEDP. The experimental results are reported to show the efficiency of
proposed algorithms compared to Particle Swarm Optimization with Time-Varying Acceleration
Coefficients (PSO-TVAC) and RCGA.
Methodological study of opinion mining and sentiment analysis techniquesijsc
Decision making both on individual and organizational level is always accompanied by the search of
other’s opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum
discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated
content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining
and sentiment analysis are the formalization for studying and construing opinions and sentiments. The
digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is
an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
An Artificial Neural Network Model for Classification of Epileptic Seizures U...ijsc
Epilepsy is one of the most common neurological disorders characterized by transient and unexpected
electrical disturbance in the brain. In This paper the EEG signals are decomposed into a finite set of band
limited signals termed as Intrinsic mode functions. The Hilbert transom is applied on these IMF’s to
calculate instantaneous frequencies. The 2nd,3rd and 4th IMF's are used to extract features of epileptic
signal. A neural network using back propagation algorithm is implemented for classification of epilepsy.
An overall accuracy of 99.8% is achieved in classification..
A new design reuse approach for voip implementation into fpsocs and asicsijsc
The aim of this paper is to present a new design reuse approach for automatic generation of Voice over Internet protocol (VOIP) hardware description and implementation into FPSOCs and ASICs. Our motivation behind this work is justified by the following arguments: first, VOIP based System on chip (SOC) implementation is an emerging research and development area, where innovative applications can be implemented. Second, these systems are very complex and due to time to market pressure, there is a need to built platforms that help the designer to explore with different architectural possibilities and choose the circuit that best correspond to the specifications. Third, we aim to develop in hardware, design, methods and tools that are used in software like the MATLAB tool for VOIP implementation. To achieve our goal, the proposed design approach is based on a modular design of the VOIP architecture. The originality of our approach is the application of the design for reuse (DFR) and the design with reuse (DWR) concepts. To validate the approach, a case study of a SOC based on the OR1K processor is studied. We demonstrate that the proposed SoC architecture is reconfigurable, scalable and the final RTL code can be reused for any FPSOC or ASIC technology. As an example, Performances measures, in the VIRTEX-5 FPGA device family, and ASIC 65nm technology are shown through this paper.
Solving the associated weakness of biogeography based optimization algorithmijsc
Biogeography-based optimization (BBO) is a new population-based evolutionary algorithm and is based on an old theory of island biogeography that explains the geographical distribution of biological organisms. BBO was introduced in 2008 and then a lot of modifications and hybridizations were employed to enhance its performance. The researchers found that the original version of BBO has some weakness on its exploration. This paper tries to solve the root problems itself instead of solving its effect by using different techniques. It proposes two modifications; firstly, modifying the probabilistic selection process of the migration and mutation stages to give a fairly randomized selection for all the features of the islands. Secondly, the clear duplication process, which is located after the mutation stage, is sized to avoid any corruption on the suitability index variables of the non-mutated islands. The proposed modifications are extensively tested on 120 test functions with different dimensions and complexities. The results proved that the BBO performance can be enhanced effectively without embedding any additional sub-algorithm, and without using any complicated form of the immigration and emigration rates. In addition, the new BBO algorithm requires less CPU time and becomes even faster than the original simplified partial migration-based BBO. These essential modifications have to be considered as an initial step for any other modifications.
Soft computing is likely to play aprogressively important role in many applications including image enhancement. The paradigm for soft computing is the human mind. The soft computing critique has been particularly strong with fuzzy logic. The fuzzy logic is facts representationas a
rule for management of uncertainty. Inthis paperthe Multi-Dimensional optimized problem is addressed by discussing the optimal thresholding usingfuzzyentropyfor Image enhancement. This technique is compared with bi-level and multi-level thresholding and obtained optimal
thresholding values for different levels of speckle noisy and low contrasted images. The fuzzy entropy method has produced better results compared to bi-level and multi-level thresholding techniques.
A reliable gait features are required to extract the gait sequences from an images. In this paper suggested a
simple method for gait identification which is based on moments. Moment values are extracted on different
number of frames of Gray Scale and Silhouette images of CASIA database. These moment values are
considered as feature values. Fuzzy logic and Nearest Neighbor Classifier are used for classification. Both
achieved higher recognition.
Classical methods for classification of pixels in multispectral images include supervised classifiers such as
the maximum-likelihood classifier, neural network classifiers, fuzzy neural networks, support vector
machines, and decision trees. Recently, there has been an increase of interest in ensemble learning – a
method that generates many classifiers and aggregates their results. Breiman proposed Random Forestin
2001 for classification and clustering. Random Forest grows many decision trees for classification. To
classify a new object, the input vector is run through each decision tree in the forest. Each tree gives a
classification. The forest chooses the classification having the most votes. Random Forest provides a robust
algorithm for classifying large datasets. The potential of Random Forest is not been explored in analyzing
multispectral satellite images. To evaluate the performance of Random Forest, we classified multispectral
images using various classifiers such as the maximum likelihood classifier, neural network, support vector
machine (SVM), and Random Forest and compare their results.
Fault diagnosis of a high voltage transmission line using waveform matching a...ijsc
This paper is based on the problem of accurate fault diagnosis by incorporating a waveform matching technique. Fault isolation and detection of a double circuit high voltage power transmission line is of immense importance from point of view of Energy Management services. Power System Fault types namely single line to ground faults, line to line faults, double line to ground faults etc. are responsible for transients in current and voltage waveforms in Power Systems. Waveform matching deals with the approximate superimposition of such waveforms in discretized versions obtained from recording devices and Software respectively. The analogy derived from these waveforms is obtained as an error function of voltage and current, from the considered metering devices. This assists in modelling the fault identification as an optimization problem of minimizing the error between these sets of waveforms. In other words, it utilizes the benefit of software discrepancies between these two waveforms. Analysis has been done using the Bare Bones Particle Swarm Optimizer on an IEEE 2 bus, 6 bus and 14 bus system. The performance of the algorithm has been compared with an analogous meta-heuristic algorithm called BAT optimization on a 2 bus level. The primary focus of this paper is to demonstrate the efficiency of such methods and state the common peculiarities in measurements, and the possible remedies for such distortions.
Broad Phoneme Classification Using Signal Based Features ijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme classification is achieved using features derived directly from the speech at the signal level itself. Broad phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first three formants. Features derived from short time frames of training speech are used to train a multilayer feedforward neural network based classifier with manually marked class label as output and classification accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction which is useful for applications such as automatic speech recognition and automatic language identification.
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabularysipij
In this paper, an effective approach for Chinese speech recognition on small vocabulary size is proposed - the independent speech recognition of Chinese words based on Hidden Markov Model (HMM). The features of speech words are generated by sub-syllable of Chinese characters. Total 640 speech samples are recorded by 4 native males and 4 females with frequently speaking ability. The preliminary results of inside and outside testing achieve 89.6% and 77.5%, respectively. To improve the performance, keyword spotting criterion is applied into our system. The final precision rates for inside and outside testing in average achieve 92.7% and 83.8%. The results prove that the approach for Chinese speech recognition on small vocabulary is effective.
Audio/Speech Signal Analysis for Depressionijsrd.com
The word “depressed†is a common everyday word. People might say "I am depressed" when in fact they mean "I am fed up because I have had a row, or failed an exam, or lost my job", etc. These ups and downs of life are common and normal. Most people recover quite quickly. Depression is identified by different methods. Here we are identified depression by MFCC (Mel Frequency Ceptral Coefficient) method. There are different parameters used for the identification of depressed speech and normal speech, but MFCCs based parameter is the most applicable information then other parameter because depressive speech or audio signal can contain more information in the higher energy bands when compared with normal speech.
Speech Enhancement for Nonstationary Noise Environmentssipij
In this paper, we present a simultaneous detection and estimation approach for speech enhancement in nonstationary noise environments. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude error, while under speech-absence, the distortion depends on a certain attenuation factor. Experimental results demonstrate the advantage of using the proposed simultaneous detection and estimation approach which facilitate suppression of nonstationary noise with a controlled level of speech distortion.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
Speech Analysis and synthesis using VocoderIJTET Journal
Abstract— In this paper, I proposed a speech analysis and synthesis using a vocoder. Voice conversion systems do not create new speech signals, but just transform existing one. The proposed speech vocoding is different from speech coding. To analyze the speech signal and represent it with less number of bits, so that bandwidth efficiency can be increased. The Synthesis of speech signal from the received bits of information. In this paper three aspects of analysis have been discussed: pitch refinement, spectral envelope estimation and maximum voiced frequency estimation. A Quasi-harmonic analysis model can be used to implement a pitch refinement algorithm which improves the accuracy of the spectral estimation. Harmonic plus noise model to reconstruct the speech signal from parameter. Finally to achieve the highest possible resynthesis quality using the lowest possible number of bits to transmit the speech signal. Future work aims at incorporating the phase information into the analysis and modeling process and also synthesis these three aspects in different pitch period.
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
In this paper, a system, developed for speech encoding, analysis, synthesis and gender identification is
presented. A typical gender recognition system can be divided into front-end system and back-end system.
The task of the front-end system is to extract the gender related information from a speech signal and
represents it by a set of vectors called feature. Features like power spectrum density, frequency at
maximum power carry speaker information. The feature is extracted using First Fourier Transform (FFT)
algorithm. The task of the back-end system (also called classifier) is to create a gender model to recognize
the gender from his/her speech signal in recognition phase. This paper also presents the digital processing
of a speech signals (pronounced “A” and “B”) which are taken from 10 persons, 5 of them are Male and
the rest of them are Female. Power Spectrum Estimation of the signal is examined .The frequency at
maximum power of the English Phonemes is extracted from the estimated power spectrum. The system uses
threshold technique as identification tool. The recognition accuracy of this system is 80% on average.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...CSCJournals
In this paper, a voice activity detector is proposed on the basis of Gaussian modeling of noise in the spectro-temporal space. Spectro-temporal space is obtained from auditory cortical processing. The auditory model that offers a multi-dimensional picture of the sound includes two stages: the initial stage is a model of inner ear and the second stage is the auditory central cortical modeling in the brain. In this paper, the speech noise in this picture has been modeled by a 3-D mono Gaussian cluster. At the start of suggested VAD process, the noise is modeled by a Gaussian shaped cluster. The average noise behavior is obtained in different spectrotemporal space in various points for each frame. In the stage of separation of speech from noise, the criterion is the difference between the average noise behavior and the speech signal amplitude in spectrotemporal domain. This was measured for each frame and was used as the criterion of classification. Using Noisex92, this method is tested in different noise models such as White, exhibition, Street, Office and Train noises. The results are compared to both auditory model and multifeature method. It is observed that the performance of this method in low signal-to-noise ratios (SNRs) conditions is better than other current methods.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Broad phoneme classification using signal based features
1. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
BROAD PHONEME CLASSIFICATION USING SIGNAL
BASED FEATURES
Deekshitha G1 and Leena Mary2
1,2Advanced Digital Signal Processing Research Laboratory, Department of Electronics
and Communication, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, India
ABSTRACT
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
KEYWORDS
Automatic Speech Recognition, Broad phoneme classes, Neural Network Classifier, Phoneme, Syllable,
Signal level features,
1. INTRODUCTION
In terms of human communication, speech is the most important and efficient mode of
communication even in today's multimedia society. So we want automatic speech recognition
systems to be capable of recognizing fluent conversational speech from any random speaker.
Speech recognition (SR) is the translation of spoken words into text. In the ASR (Automatic
speech recognition) system, the first step is feature extraction, where the sampled speech signal is
parameterized. The goal is to extract a number of parameters/features from the signal that
contains maximum information relevant for the classification. The most popular spectral based
parameter used in state-of-art ASR is the Mel Frequency Cepstral Coefficients (MFCC) [1][2].
The main demerits of MFCC include its complex calculation and limited recognition rate. The
poor phone recognition accuracy in conventional ASR is later compensated by the language
models at sub word and word levels to get reasonable word error rate [3].
Speech is composed of basic sound units known as phonemes [1][2]. The waveform
representation of each phoneme is characterized by a small set of distinctive features, where a
distinctive feature is a minimal unit which distinguishes between two maximally close but
linguistically distinct speech sounds. These acoustic features should not be affected by different
DOI: 10.5121/ijsc.2014.5301 1
2. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
vocal tract sizes and shapes of speakers and the changes in voice quality. In this work, we have
attempted broad phoneme classification using signal level features. Broad phoneme classes
include vowels, nasals, plosives, fricatives, approximants and silence [1][2]. Each of these classes
has some discriminant features so that they can be easily classified. For example vowels can be
easily categorized with its higher amplitude, whereas fricatives with their high zero crossing rate.
So for identifying such characteristics an analysis study was conducted and results were
summarized. From this study, we have come up with a set of feature vectors capable of
performing broad phoneme classification.
The paper is organized as follows: Section 2 gives the system overview by explaining the
database, relevant features extraction and classifier. In Section 3, details of experiments are
described with evaluation results. Finally in Section 4, the paper is wrapped up with a summary
and scope for future work.
2
2. SYSTEM OVERVIEW
Figure 1. Broad phoneme classification
A classifier system for acoustic-phonetic analysis of continuous speech is being developed to
serve as part of an automatic speech recognition system. The system accepts the speech wave-form
as an input and produces a string of broad phoneme-like units as output. As shown in Figure
1, the input speech is parameterized in to normalized feature set [4][5], and then this feature is
applied to the ANN [6] classifier in order to obtain the broad phoneme class labels.
Table 1. Broad phoneme classes considered in the work.
3. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
The broad phoneme classes in Malayalam language that are considered in this work, their
symbols and corresponding International Phonetic Alphabet (IPA) [1] symbols are listed in Table
1. The broad classifications yield the following categories: vowels, nasals [7], fricatives, plosives,
approximants [5], and silence leading to six broad categories. For classifying the input sound
units into one among the six classes, specific features are needed to distinguish them. So a study
was conducted first by analyzing pitch contour, energy contour, formant plot, frequency spectrum
etc.
3
2.1. Data Collection and Transcription
First phase of the work was the data collection and its manual transcription which was much time
consuming. For speech data, Malayalam read speech was downloaded from All India Radio
(AIR), and speech was also recorded in the lab environment. Figure 2 illustrates the manual
labeling of speech signal using wavesurfer.
2.2. Feature Extraction
A preliminary analysis study was conducted to identify discriminative features helpful for broad
phoneme classification. Various features [4][5][6] explored include the following: voicing
information, short time energy, Zero Crossing Rate (ZCR), Most Dominant Frequency (MDF),
magnitude at the MDF, spectral flatness measure, formant frequencies, difference and ratios of
formant frequencies, bandwidth of formant frequencies etc [1][2]. Among them we selected nine
features for designing the broad phoneme classifier. It includes (1) voicing decision, (2) ZCR, (3)
short time energy, (4) most dominant frequency, (5) magnitude at MDF, (6) spectral flatness
measure, and (7) first three formants. Brief descriptions about these selected features are
explained below.
2.2.1. Voiced/Unvoiced Decision
Voicing information helps to determine whether the frame is voiced or unvoiced. Here we
extracted pitch information using auto correlation function. Autocorrelation sequence is
symmetric with respect to zero lag. The pitch period information is more pronounced in the
autocorrelation sequence compared to speech and pitch period can be computed easily by a simple
peak picking algorithm [6].
2.2.2. Zero Crossing Rate (ZCR)
The high and low values of ZCR correspond to unvoiced and voiced speech respectively as
shown in Figure 3. In a speech signal S (n), a zero-crossing occurs when the waveform crosses
the time axis or changes algebraic sign [1].
− − − = ¥
ZCR(n) 0.5 | sgn(S(m)) sgn(S(m 1)) | w(n m)
m
=−¥
1, ( ) 0
−
³
=
forS n
otherwise
S n
1,
sgn( ( ))
4. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
4
Figure 2: (a) Manual labelling to broad phonemes, (b) pitch contour and (c) corresponding speech
waveform.
Figure 3. (a) Speech waveform and (b) ZCR.
2.2.3. Short Time Energy (STE)
Energy can be used to segment speech into smaller phonetic units in ASR systems. Short Time
Energy is a squaring or absolute magnitude operation. Energy emphasizes high amplitudes and is
simpler to calculate. The large variation in amplitudes of voiced and unvoiced speech, as well as
smaller variations between phonemes with different manners of articulation, permit
segmentations based on energy [1][2].
− = ¥
( ) ( ) ( ) 2 E n S n w n m
m
=−¥
2.2.4. Most Dominant Frequency (MDF)
The most dominant frequency is the frequency of sinusoidal component with highest amplitude.
If the signal is highly periodic and more or less sinusoidal, the dominant frequency will in most
cases be related to the rate of the signal. The more a signal looks like a sine wave; the better
dominant frequency analysis will be able to reflect the periodicity of the signal.
2.2.5. Spectral Flatness Measure (SF)
Spectral flatness is a measure used to characterize an audio spectrum. Spectral flatness is typically
measured in decibels, and provides a way to quantify how tone-like a sound is, as opposed to
being noise-like. A high spectral flatness indicates that the spectrum has a similar amount of
power in all spectral bands i.e., similar to white noise. A low spectral flatness indicates that the
spectral power is concentrated in a relatively small number of bands i.e., a mixture of sine waves.
5. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
The spectral flatness is calculated by dividing the geometric mean of the power spectrum by the
arithmetic mean of the power spectrum.
5
GM
AM
SpectralFlatness =
2.2.6. Formant Frequencies
Formants are the spectral peaks of the sound spectrum of the voice. It is often measured as
amplitude peaks in the frequency spectrum of a sound. The behaviour of the first 3 formants is of
crucial importance. Figure 4 justifies the selection of formant frequency [7] as a feature since it
classifies vowels and nasals.
Figure 4.Vowel-nasal classification using first two formants
An optimum set of features has to be found, which describes the classes with a few, but very
powerful parameters whereby showing optimum class discrimination abilities [6]. Thus, the
feature extraction process can also be seen as a data reduction process. For getting the features
from the input speech signal, first the signal is normalized and then it is segmented into frames of
size 20ms with a 10ms overlap. Then for each of this frame, the features are calculated. The
choice of the features is done after conducting a study which is summarized in Table 2. Figure 5
shows the plot of selected features for a particular word.
Table 2. Broad phonemes classes and their features (V-voiced, UV-unvoiced)
Broad
Phoneme
Classes
Features
Voicing
Information
ZCR STE Duration
Presence
of burst
Strength
of
formants
Vowels V Low High Long No Strong F1
Nasals V Low Medium Medium No Weak F2
Plosives V/UV Medium Low Short Yes -
Fricatives V/UV High Low Long Yes -
Approximants V Low High Short No High F3
Silence UV Low Low - No -
6. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
6
Figure 5. (a) Waveform, (b) STE, (c) ZCR, (d) voicing information, (e) MDF, (f) SF, (g) magnitude at
MDF, (h) F1, and (i) F2 for the word //aakarshanam// in Malayalam.
2.3. Classifier Design
As explained, the feature vectors for each short frame (width 20 ms) are calculated and
normalized. Then this features are applied to a classifier for broad phoneme classification. Here
an multilayer feedforward artificial neural network [6][8][9] is used as a classifier. Neural
networks are also similar to biological neural networks in performing functions collectively and
in parallel by the units.
Figure 6. Neural network classifier for broad phoneme classification
7. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
In this work a feed-forward back propagation [9] network is created with five layers [8], one input
layer, one output layer and three hidden layers. The final classifier has a structure 27L 54N 20N
10N 6L where L represents linear neurons and N represents non-linear neurons. Here the non-linear
neurons use 'log sigmoid' activation function. Feed forward networks have one-way
connections from input to output layers. Here the network training function used is ‘traincgb’, so
that updates weight and bias values according to the conjugate gradient back propagation with
Powell-Beale restarts.
7
Figure 7. Feedforward neural network
3. EXPERIMENTAL RESULTS
3.1. Classifier Training
The training of the neural network is a time consuming step, which will end either by a validation
stop or by reaching the maximum number of epochs. For training the network about 8 minutes of
read speech is used. The speech data is divided into frames of size 20ms with 10ms overlap. For
each of these frames 9 feature vectors were calculated, and along with this the difference between
the previous frame and the succeeding frame are also considered. So all together 27 (9x3) inputs
are applied to the network for each frame along with its manual transcription (supervised
learning).
3.2. Evaluation
For testing, read speeches as well as some isolated words are collected. As done during training,
the speech data is segmented into frames of 20ms with 10ms overlap. Feature vector of each
frame is applied to the network classifier, so that the network outputs are interpreted as a label
that suits the best for the given feature vector.
3.3. Results and discussion
The output labels obtained while testing the classifier using selected words are given below. One
broad phoneme symbol is assigned for each short frame (20 ms frame size and 10 ms frame shift)
of speech.
a. bhakshanam
MANUAL TRANSCRIPTION:
SSSSSSSSSSSPPPPPVVVVVVVPPPPPPPPPFFFFFFFFFFFVVVVVVVVNNNNNNVVVVV
VVVNNNNNNNNNNNNSSSSSSSSSSSSSS
8. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
8
OUTPUT OF PROPOSED METHOD:
SSSSSSSSSSSSPPPPVVVVVVVPPPPPFFFFFFFFFFFFFPPVVVVVVVVVVVVVNVVVVV
VVVNNNNNNNNNNNNPPPPSSSSFSFSFS
b. haritham
MANUAL TRANSCRIPTION:
FFFFFFFFFFFFFFVVVVVVVVVVAAAAVVVVVVVVPPPPPPPPPPVVVVVVNNNNNNN
NNNNNNSSSSSSSSSS
PREDICTED PHONEME LABELS USING THE PROPOSED METHOD:
SSFFFFFFFFFPVVVVVVVVVVVVPPPAAAAAVVVVVVVPPPPPPVVVVVVVVNNNVVV
NNNNPPNNNNNFFFFF
c. hasyam
MANUAL TRANSCRIPTION:
SSSSSSSSSSFFFFFFFFFFFVVVVVVVVVVVVVVVFFFFFFFFFFFFFFFFFFFAAAAAAVV
VVVNNNNNNNNNNNNNSSSSSSSSSSS
PREDICTED PHONEME LABELS USING THE PROPOSED METHOD:
SSSSSSSSSSSSPFFFFPPPVVVVVVVVVVVVVVVFFFFFFFFFFFFFFFFFPPPAAAAAAVV
VVVVVVVVVVVVNNNNNPPPSSSSSSS
d. manasantharam
MANUAL TRANSCRIPTION:
SSSSNNNNNVVVVVVVVVVVVVNNNNNNNNVVVVVVVVFFFFFFFFFVVVVVVVVVV
VNNNNNNNNPPPPVVVVAAAAVVVVVNNNNNNNNNNNNNNNNNN
PREDICTED PHONEME LABELS USING THE PROPOSED METHOD:
SNNNNNNNAAAAAVVVVVAAVVNNNNNNAAAAAVVVPFFFFFFPPPAAAAAAAAVV
VVNNNNNVVVPAAAAVVVVNVAVVVVVNNNNNNNNNNNNNNNNNV
The confusion matrix obtained while testing some read speech is as shown in Table 3.
9. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
9
Table 3. Identification accuracy in percentage for proposed feature set, MFCC and combined system.
IDENTIFICATION ACCURACY (IN %)
USING PROPOSED FEATURES USING MFCC COMBINED SYSTEM
V N P F A S V N P F A S V N P F A S
TEST
1
V 53 16 9 1 21 0 51 8 11 7 23 0 57 9 8 6 20 0
N 20 64 12 0 4 0 9 37 25 5 23 1 12 40 22 4 22 0
P 16 22 47 3 4 8 8 8 39 22 22 1 10 11 38 20 19 2
F 0 0 16 78 1 5 1 0 13 80 6 0 0 0 5 95 0 0
A 22 33 29 0 16 0 24 10 19 6 41 0 29 13 16 7 35 0
S 4 2 14 4 0 76 0 2 27 20 2 49 0 2 22 9 2 6
5
TEST
2
V 64 17 4 0 14 1 65 11 9 3 12 0 72 7 5 3 12 1
N 20 57 17 1 2 3 8 65 10 10 7 0 9 67 8 9 6 1
P 16 34 38 2 3 7 11 6 43 25 15 0 14 6 40 25 14 1
F 2 4 20 74 0 0 0 0 15 83 2 0 0 0 11 87 2 0
A 37 26 25 2 10 0 27 7 15 15 36 0 32 8 15 15 30 0
S 4 0 12 17 0 67 0 1 13 30 2 54 0 0 8 24 1 6
7
TEST
3
V 61 9 3 3 24 0 30 32 10 1 27 0 51 21 6 0 22 0
N 43 35 0 0 22 0 11 73 13 0 3 0 19 60 16 0 5 0
P 29 6 32 19 8 6 10 25 41 5 19 0 11 22 40 9 16 2
F 22 0 13 65 0 0 0 13 39 44 4 0 0 13 26 57 4 0
A 63 8 2 0 27 0 6 51 8 0 35 0 29 30 6 0 35 0
S 0 0 25 0 0 75 0 3 56 6 9 26 0 3 47 6 9 3
5
Table 3 shows results obtained from three systems: system with proposed feature set, system
using standard MFCC features and a score level combination/fusion system. Test 1 2 are male
speech and TEST 3 is female speech. Here V, N, P, F, A, S represents Vowel, Nasal, Plosive,
Fricative, Approximant, Silence respectively. Test 1, Test 2 and Test 3 are three different read
speech data used for testing. Notice that the proposed system maps V, N, P, F, S almost correctly.
But it misclassifies approximants as vowels. System using MFCC classifies approximants (A)
more accurately. So we made a score level fusion of the proposed and MFCC classifier to get a
combined system as shown in Figure 9.
Figure 9. Combined system
With a combined system we have about 54%, 61%, and 46% of classifier accuracy for TEST 1,
TEST 2 and TEST 3 respectively.
10. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
10
3.4 Application of proposed system for syllable structure prediction
For predicting the syllable structure from the frame level broad phoneme classifier output, some
smoothing strategies are employed. For classes with long duration, like vowels, if there are more
than 5 consecutive frames with label 'V', we accept the label 'V' to the syllable structure, else it is
omitted. But for short phonemes such as plosives, occurrence of three or consecutive 'P' symbol
is accepted as label to the syllable structure. Syllable structure prediction corresponding to three
words as per this strategy is given in Table 4.
Table 4. Syllable structure prediction
Word
Actual syllable sequence Predicted syllable sequence
//Hasyam//
/S/,/FV/,/FAVN/,/S/ /S/,/FV/,/FAVN/,/S/
// Haritham// /FV/,/AV/,/PVN/,/S/ /FV/,/AV/,/PVN/,/F/
// Bhakshanam// /S/,/PV/,/PFV/,/NVN/,/S/ /S/,/PV/,/PFV/,/VNP/,/S/
4. SUMMARY AND SCOPE OF FUTURE WORK
In this work, broad phoneme classification is attempted using signal level features. For broad
phoneme classification, the features such as voicing, nasality, frication, vowel formants etc. are
studied. From this study, we have come up with a set of feature vectors capable of performing
broad classification of phonemes. Classifier is developed using feed forward neural network in
order to automatically label each frame in terms of broad phoneme classes. The effectiveness of
the proposed feature set is evaluated on speech database and is compared with standard MFCC
features. Complimentary nature of both proposed features and MFCC is illustrated by combining
the scores of the systems to gain an improvement in classification accuracy. Output of this
classifier is used for broad syllable structure prediction.
The study can be further extended to phoneme classification instead of broad phoneme
classification with an extended feature set. There is a scope for finer syllable structure prediction
which has applications in area of automatic speech recognition and language recognition.
ACKNOWLEDGEMENTS
The authors would like to thank Department of Electronics and Information Technology,
Government of India for providing financial assistance and Prof. B. Yegnanarayana, Professor,
International Institute of Information Technology, Hyderabad for the motivation to carry out the
work discussed in the paper.
REFERENCES
[1] G Douglas O' Shaughnessy, (2000) Speech communications-Human and Machine, IEEE press,
Newyork, 2nd Edition.
[2] L Rabiner B H Juang, (1993) Fundamentals of Speech recognition, Prentice Hall.
[3] Sadaoki Furui, “50 Years of Progress in Speech and Speaker Recognition Research”, ECTI
Transactions on computer and Information technology, ol.1, No. 2, Nov 2005, pp. 64-74.
[4] Carol Y Espy (1986) “A Phonetically Based Semivowel Recognition System”, ICASSP 86, Tokyo.
11. International Journal on Soft Computing (IJSC) Vol. 5, No. 3, November 2014
[5] Carol Y. Epsy-Wilson, “A feature-based semivowel recognition system”, J. Acoustical Society of
11
America, vol. 96, No. 1, July 1994.pp.65-72.
[6] Jyh-Shing Roger Jang, Chuen- Tsai Sun Eiji Mizutani, (1997) Neuro-Fuzzy and Soft Computing,
Prentice Hall, 1st Edition.
[7] T. Pruthi, C. Y. Epsy-Wilson, “Acoustic parameters for automatic detection of nasal manner”,
Elsevier, Speech Communication 43 (2004), pp. 225-239.
[8] Fu Guojiang, “A Novel Isolated Speech Recognition Method based on Neural Network”,2nd
International Conference on Networking and Information technology, IPCSIT, vol. 17 (2011),
IACSIT Press, Singapore, pp. 264-269.
[9] Caltenco F, Gevaert W, Tseno G, Mladenov, “Neural Networks used for Speech Recognition”,
Journal of Automatic Control, University of Belgrade, vol. 20, 2010.
[10] Sakshat Virtual Labs, Department of Electronics and Electrical Engineering, IIIT Guwahati.
[11] Speech production mechanism (Tutorial): Speech Signal Processing: Computer Science
Engineering: III Hyderabad Virtual Lab.
[12] Wavesurfer User Manual, http://www.speech.kth.se/wavesurfer/man18. html, pp:1-6, 9/23/2013.
Authors
Deekshitha G. graduated from Cochin University of Science and Technology in
Electronics and Communication Engineering in 2012. She is currently doing masters
degree in Advanced Communication and Information Systems at Rajiv Gandhi Institute of
Technology, Kottayam Kerala, India. Her areas of interest are image processing and speech
processing.
Leena Mary received her Bachelor’s degree from Mangalore University in 1988. She
obtained her MTech from Kerala University and Ph.D. from Indian Institute of Technology,
Madras, India. She has 23 years of teaching experience. Currently she is working as
Professor in Electronics and Communication Engineering at Rajiv Gandhi Institute of
Technology, Kottayam, Kerala, India. Her research interests are speech processing, speaker
forensics, signal processing and neural networks. She has published several research papers
which includes a book on Extraction and Representation of Prosody for Speaker, Speech and Language
Recognition by Springer. She is a member of IEEE and a life member of Indian Society for Technical
Education.