This document discusses a study investigating the combined use of Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) features in automatic speech recognition systems. It begins by outlining the challenges of automatic speech recognition and then describes the MFCC and LPC algorithms for extracting basic speech features. The study suggests combining MFCC and LPC-based recognition subsystems to improve reliability. Neural networks are used for training and recognition, and results show the combined approach improves recognition quality compared to individual methods.
Voiced Speech Characterisation Based On Empirical Mode Decompositionoptljjournal
Empirical Mode Decomposition (EMD) is a tool for the analysis of multi-component signals. The EMD
algorithm decomposes adaptively a given oscillation modes namely the functions of intrinsic mode (IMFs)
extracted from the signal itself signal. The analysis method is no need for a basic function fixed a priori as
conventional analytical methods (eg Fourier transform and the wavelet transform). In this paper, the
algorithm of empirical mode decomposition (EMD) is proposed as an alternative to estimate the vocal tract
formants characterizing the vocal tract. The proposed method was tested on natural speech. LPC analysis
of the first three functions intrinsic modes using the autocorrelation is calculated; a comparison was made
between the LPC analysis of the first three vowel of MFIs studied and the LPC analysis of the speech
signal.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
A new parallel bat algorithm for musical note recognition IJECEIAES
Music is a universal language that does not require an interpreter, where feelings and sensitivities are united, regardless of the different peoples and languages, The proposed system consists of two main stages: the process of extracting important properties using the linear discrimination analysis (LDA) This step is carried out after the initial treatment process using various procedures to remove musical lines, The second stage describes the recognition process using the bat algorithm, which is one of the metaheuristic algorithms after modifying the bat algorithm to obtain better discriminating results. The proposed system was supported by parallel implementation using the (developed bat algorithm DBA), which increased the speed of implementation significantly. The method was applied to 1250 different images of musical notes. The proposed system was implemented using MATLAB R2016a, Work was done on a Windows10 Processor OS (Intel ® Core TM i5-7200U CPU @ 2.50GHZ 2.70GHZ) computer.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
This Part 2 presentation is a more in-depth view of BERT - Bidirectional Encoder Representations from Transformer. The source links offer more depth to the brief overview in the slides
Voiced Speech Characterisation Based On Empirical Mode Decompositionoptljjournal
Empirical Mode Decomposition (EMD) is a tool for the analysis of multi-component signals. The EMD
algorithm decomposes adaptively a given oscillation modes namely the functions of intrinsic mode (IMFs)
extracted from the signal itself signal. The analysis method is no need for a basic function fixed a priori as
conventional analytical methods (eg Fourier transform and the wavelet transform). In this paper, the
algorithm of empirical mode decomposition (EMD) is proposed as an alternative to estimate the vocal tract
formants characterizing the vocal tract. The proposed method was tested on natural speech. LPC analysis
of the first three functions intrinsic modes using the autocorrelation is calculated; a comparison was made
between the LPC analysis of the first three vowel of MFIs studied and the LPC analysis of the speech
signal.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
A new parallel bat algorithm for musical note recognition IJECEIAES
Music is a universal language that does not require an interpreter, where feelings and sensitivities are united, regardless of the different peoples and languages, The proposed system consists of two main stages: the process of extracting important properties using the linear discrimination analysis (LDA) This step is carried out after the initial treatment process using various procedures to remove musical lines, The second stage describes the recognition process using the bat algorithm, which is one of the metaheuristic algorithms after modifying the bat algorithm to obtain better discriminating results. The proposed system was supported by parallel implementation using the (developed bat algorithm DBA), which increased the speed of implementation significantly. The method was applied to 1250 different images of musical notes. The proposed system was implemented using MATLAB R2016a, Work was done on a Windows10 Processor OS (Intel ® Core TM i5-7200U CPU @ 2.50GHZ 2.70GHZ) computer.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
This Part 2 presentation is a more in-depth view of BERT - Bidirectional Encoder Representations from Transformer. The source links offer more depth to the brief overview in the slides
Chaotic signals denoising using empirical mode decomposition inspired by mult...IJECEIAES
Empirical mode decomposition (EMD) is an effective noise reduction method to enhance the noisy chaotic signal over additive noise. In this paper, the intrinsic mode functions (IMFs) generated by EMD are thresholded using multivariate denoising. Multivariate denoising is multivariable denosing algorithm that is combined wavelet transform and principal component analysis to denoise multivariate signals in adaptive way. The proposed method is compared at a various signal to noise ratios (SNRs) with different techniques and different types of noise. Also, scale dependent Lyapunov exponent (SDLE) is used to test the behavior of the denoised chaotic signal comparing with clean signal. The results show that EMD-MD method has the best root mean square error (RMSE) and signal to noise ratio gain (SNRG) comparing with the conventional methods.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
Fuzzy Logic and Neuro-fuzzy Systems: A Systematic IntroductionWaqas Tariq
Fuzzy logic is a rigorous mathematical field, and it provides an effective vehicle for modeling the uncertainty in human reasoning. In fuzzy logic, the knowledge of experts is modeled by linguistic rules represented in the form of IF-THEN logic. Like neural network models such as the multilayer perceptron (MLP) and the radial basis function network (RBFN), some fuzzy inference systems (FISs) have the capability of universal approximation. Fuzzy logic can be used in most areas where neural networks are applicable. In this paper, we first give an introduction to fuzzy sets and logic. We then make a comparison between FISs and some neural network models. Rule extraction from trained neural networks or numerical data is then described. We finally introduce the synergy of neural and fuzzy systems, and describe some neuro-fuzzy models as well. Some circuits implementations of neuro-fuzzy systems are also introduced. Examples are given to illustrate the cocepts of neuro-fuzzy systems.
SMATalk: Standard Malay Text to Speech Talk SystemCSCJournals
This paper presents a rule-based text- to- speech (TTS) Synthesis System for Standard Malay, namely SMaTTS. The proposed system using sinusoidal method and some pre- recorded wave files in generating speech for the system. The use of phone database significantly decreases the amount of computer memory space used, thus making the system very light and embeddable. The overall system was comprised of two phases the Natural Language Processing (NLP) that consisted of the high-level processing of text analysis, phonetic analysis, text normalization and morphophonemic module. The module was designed specially for SM to overcome few problems in defining the rules for SM orthography system before it can be passed to the DSP module. The second phase is the Digital Signal Processing (DSP) which operated on the low-level process of the speech waveform generation. A developed an intelligible and adequately natural sounding formant-based speech synthesis system with a light and user-friendly Graphical User Interface (GUI) is introduced. A Standard Malay Language (SM) phoneme set and an inclusive set of phone database have been constructed carefully for this phone-based speech synthesizer. By applying the generative phonology, a comprehensive letter-to-sound (LTS) rules and a pronunciation lexicon have been invented for SMaTTS. As for the evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was compiled and several experiments have been performed to evaluate the quality of the synthesized speech by analyzing the Mean Opinion Score (MOS) obtained. The overall performance of the system as well as the room for improvements was thoroughly discussed.
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Maskingdhia_naruto
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Chaotic signals denoising using empirical mode decomposition inspired by mult...IJECEIAES
Empirical mode decomposition (EMD) is an effective noise reduction method to enhance the noisy chaotic signal over additive noise. In this paper, the intrinsic mode functions (IMFs) generated by EMD are thresholded using multivariate denoising. Multivariate denoising is multivariable denosing algorithm that is combined wavelet transform and principal component analysis to denoise multivariate signals in adaptive way. The proposed method is compared at a various signal to noise ratios (SNRs) with different techniques and different types of noise. Also, scale dependent Lyapunov exponent (SDLE) is used to test the behavior of the denoised chaotic signal comparing with clean signal. The results show that EMD-MD method has the best root mean square error (RMSE) and signal to noise ratio gain (SNRG) comparing with the conventional methods.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
Fuzzy Logic and Neuro-fuzzy Systems: A Systematic IntroductionWaqas Tariq
Fuzzy logic is a rigorous mathematical field, and it provides an effective vehicle for modeling the uncertainty in human reasoning. In fuzzy logic, the knowledge of experts is modeled by linguistic rules represented in the form of IF-THEN logic. Like neural network models such as the multilayer perceptron (MLP) and the radial basis function network (RBFN), some fuzzy inference systems (FISs) have the capability of universal approximation. Fuzzy logic can be used in most areas where neural networks are applicable. In this paper, we first give an introduction to fuzzy sets and logic. We then make a comparison between FISs and some neural network models. Rule extraction from trained neural networks or numerical data is then described. We finally introduce the synergy of neural and fuzzy systems, and describe some neuro-fuzzy models as well. Some circuits implementations of neuro-fuzzy systems are also introduced. Examples are given to illustrate the cocepts of neuro-fuzzy systems.
SMATalk: Standard Malay Text to Speech Talk SystemCSCJournals
This paper presents a rule-based text- to- speech (TTS) Synthesis System for Standard Malay, namely SMaTTS. The proposed system using sinusoidal method and some pre- recorded wave files in generating speech for the system. The use of phone database significantly decreases the amount of computer memory space used, thus making the system very light and embeddable. The overall system was comprised of two phases the Natural Language Processing (NLP) that consisted of the high-level processing of text analysis, phonetic analysis, text normalization and morphophonemic module. The module was designed specially for SM to overcome few problems in defining the rules for SM orthography system before it can be passed to the DSP module. The second phase is the Digital Signal Processing (DSP) which operated on the low-level process of the speech waveform generation. A developed an intelligible and adequately natural sounding formant-based speech synthesis system with a light and user-friendly Graphical User Interface (GUI) is introduced. A Standard Malay Language (SM) phoneme set and an inclusive set of phone database have been constructed carefully for this phone-based speech synthesizer. By applying the generative phonology, a comprehensive letter-to-sound (LTS) rules and a pronunciation lexicon have been invented for SMaTTS. As for the evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was compiled and several experiments have been performed to evaluate the quality of the synthesized speech by analyzing the Mean Opinion Score (MOS) obtained. The overall performance of the system as well as the room for improvements was thoroughly discussed.
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Maskingdhia_naruto
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNI...cscpconf
Analysis of speech for recognition of stress is important for identification of emotional state of
person. This can be done using ‘Linear Techniques’, which has different parameters like pitch,
vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction
of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of
features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German
database. Here emotion recognition is done for different emotions like neutral, happy, disgust,
sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
In this paper, a system, developed for speech encoding, analysis, synthesis and gender identification is
presented. A typical gender recognition system can be divided into front-end system and back-end system.
The task of the front-end system is to extract the gender related information from a speech signal and
represents it by a set of vectors called feature. Features like power spectrum density, frequency at
maximum power carry speaker information. The feature is extracted using First Fourier Transform (FFT)
algorithm. The task of the back-end system (also called classifier) is to create a gender model to recognize
the gender from his/her speech signal in recognition phase. This paper also presents the digital processing
of a speech signals (pronounced “A” and “B”) which are taken from 10 persons, 5 of them are Male and
the rest of them are Female. Power Spectrum Estimation of the signal is examined .The frequency at
maximum power of the English Phonemes is extracted from the estimated power spectrum. The system uses
threshold technique as identification tool. The recognition accuracy of this system is 80% on average.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITION optljjournal
Empirical Mode Decomposition (EMD) is a tool for the analysis of multi-component signals. The EMD algorithm decomposes adaptively a given oscillation modes namely the functions of intrinsic mode (IMFs)
extracted from the signal itself signal. The analysis method is no need for a basic function fixed a priori as conventional analytical methods (eg Fourier transform and the wavelet transform). In this paper, the algorithm of empirical mode decomposition (EMD) is proposed as an alternative to estimate the vocal tract
formants characterizing the vocal tract. The proposed method was tested on natural speech. LPC analysis of the first three functions intrinsic modes using the autocorrelation is calculated; a comparison was made between the LPC analysis of the first three vowel of MFIs studied and the LPC analysis of the speech
signal.
Speech Analysis and synthesis using VocoderIJTET Journal
Abstract— In this paper, I proposed a speech analysis and synthesis using a vocoder. Voice conversion systems do not create new speech signals, but just transform existing one. The proposed speech vocoding is different from speech coding. To analyze the speech signal and represent it with less number of bits, so that bandwidth efficiency can be increased. The Synthesis of speech signal from the received bits of information. In this paper three aspects of analysis have been discussed: pitch refinement, spectral envelope estimation and maximum voiced frequency estimation. A Quasi-harmonic analysis model can be used to implement a pitch refinement algorithm which improves the accuracy of the spectral estimation. Harmonic plus noise model to reconstruct the speech signal from parameter. Finally to achieve the highest possible resynthesis quality using the lowest possible number of bits to transmit the speech signal. Future work aims at incorporating the phase information into the analysis and modeling process and also synthesis these three aspects in different pitch period.
Audio/Speech Signal Analysis for Depressionijsrd.com
The word “depressed†is a common everyday word. People might say "I am depressed" when in fact they mean "I am fed up because I have had a row, or failed an exam, or lost my job", etc. These ups and downs of life are common and normal. Most people recover quite quickly. Depression is identified by different methods. Here we are identified depression by MFCC (Mel Frequency Ceptral Coefficient) method. There are different parameters used for the identification of depressed speech and normal speech, but MFCCs based parameter is the most applicable information then other parameter because depressive speech or audio signal can contain more information in the higher energy bands when compared with normal speech.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Isolated word recognition using lpc & vector quantizationeSAT Journals
Abstract Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human recognition and their behavior. This paper explicates the theory and implementation of Speech recognition. This is a speaker-dependent real time isolated word recognizer. The major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by measuring the Minimum average distortion. All Speech Recognition systems contain Two Main Phases, namely Training Phase and Testing Phase. In the Training Phase, the Features of the words are extracted and during the recognition phase feature matching Takes place. The feature or the template thus extracted is stored in the data base, during the recognition phase the extracted features are compared with the template in the database. The features of the words are extracted by using LPC analysis. Vector Quantization is used for generating the code books. Finally the recognition decision is made based on the matching score. MATLAB will be used to implement this concept to achieve further understanding. Index Terms: Speech Recognition, LPC, Vector Quantization, and Code Book.
Arabic Phoneme Recognition using Hierarchical Neural Fuzzy Petri Net and LPC ...CSCJournals
The basic idea behind the proposed hierarchical phoneme recognition is that phonemes can be classified into specific phoneme types which can be organized within a hierarchical tree structure. The recognition principle is based on “divide and conquer” in which a large problem is divided into many smaller, easier to solve problems whose solutions can be combined to yield a solution to the complex problem. Fuzzy Petri net (FPN) is a powerful modeling tool for fuzzy production rules based knowledge systems. For building hierarchical classifier using Neural Fuzzy Petri net (NFPN), Each node of the hierarchical tree is represented by a NFPN. Every NFPN in the hierarchical tree is trained by repeatedly presenting a set of input patterns along with the class to which each particular pattern belongs. The feature vector used as input to the NFPN is the LPC parameters.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
Previous research has found autocorrelation domain as an appropriate domain for signal and noise
separation. This paper discusses a simple and effective method for decreasing the effect of noise on the
autocorrelation of the clean signal. This could later be used in extracting mel cepstral parameters for
speech recognition. Two different methods are proposed to deal with the effect of error introduced by
considering speech and noise completely uncorrelated. The basic approach deals with reducing the effect
of noise via estimation and subtraction of its effect from the noisy speech signal autocorrelation. In order
to improve this method, we consider inserting a speech/noise cross correlation term into the equations used
for the estimation of clean speech autocorrelation, using an estimate of it, found through Kernel method.
Alternatively, we used an estimate of the cross correlation term using an averaging approach. A further
improvement was obtained through introduction of an overestimation parameter in the basic method. We
tested our proposed methods on the Aurora 2 task. The Basic method has shown considerable improvement
over the standard features and some other robust autocorrelation-based features. The proposed techniques
have further increased the robustness of the basic autocorrelation-based method.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Realization and design of a pilot assist decision making system based on spee...csandit
A system based on speech recognition is proposed fo
r pilot assist decision-making. It is based
on a HIL aircraft simulation platform and uses the
microcontroller SPCE061A as the central
processor to achieve better reliability and higher
cost-effect performance. Technologies of
LPCC (linear predictive cepstral coding) and DTW (D
ynamic Time Warping) are applied for
isolated-word speech recognition to gain a smaller
amount of calculation and a better real-time
performance. Besides, we adopt the PWM (Pulse Width
Modulation) regulation technology to
effectively regulate each control surface by speech
, and thus to assist the pilot to make decisions.
By trial and error, it is proved that we have a sat
isfactory accuracy rate of speech recognition
and control effect. More importantly, our paper pro
vides a creative idea for intelligent human-
computer interaction and applications of speech rec
ognition in the field of aviation control. Our
system is also very easy to be extended and applied
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
VOICED SPEECH CHARACTERISATION BASED ON EMPIRICAL MODE DECOMPOSITIONoptljjournal
Empirical Mode Decomposition (EMD) is a tool for the analysis of multi-component signals. The EMD algorithm decomposes adaptively a given oscillation modes namely the functions of intrinsic mode (IMFs) extracted from the signal itself signal. The analysis method is no need for a basic function fixed a priori as conventional analytical methods (eg Fourier transform and the wavelet transform). In this paper, the algorithm of empirical mode decomposition (EMD) is proposed as an alternative to estimate the vocal tract formants characterizing the vocal tract. The proposed method was tested on natural speech. LPC analysis of the first three functions intrinsic modes using the autocorrelation is calculated; a comparison was made between the LPC analysis of the first three vowel of MFIs studied and the LPC analysis of the speech signal.
Similar to Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-systems (20)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
1. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
Investigation of Combined use of MFCC and
LPC Features in Speech Recognition Systems
International Science Index 19, 2008 waset.org/publications/7293
К.R. Aida–Zade, C. Ardil and S.S. Rustamov
Abstract—Statement of the automatic speech recognition
problem, the assignment of speech recognition and the application
fields are shown in the paper. At the same time as Azerbaijan speech,
the establishment principles of speech recognition system and the
problems arising in the system are investigated.
The computing algorithms of speech features, being the main part
of speech recognition system, are analyzed. From this point of view,
the determination algorithms of Mel Frequency Cepstral Coefficients
(MFCC) and Linear Predictive Coding (LPC) coefficients expressing
the basic speech features are developed. Combined use of cepstrals of
MFCC and LPC in speech recognition system is suggested to
improve the reliability of speech recognition system. To this end, the
recognition system is divided into MFCC and LPC-based recognition
subsystems. The training and recognition processes are realized in
both subsystems separately, and recognition system gets the decision
being the same results of each subsystems. This results in decrease of
error rate during recognition.
The training and recognition processes are realized by artificial
neural networks in the automatic speech recognition system. The
neural networks are trained by the conjugate gradient method. In the
paper the problems observed by the number of speech features at
training the neural networks of MFCC and LPC-based speech
recognition subsystems are investigated.
The variety of results of neural networks trained from different
initial points in training process is analyzed. Methodology of
combined use of neural networks trained from different initial points
in speech recognition system is suggested to improve the reliability
of recognition system and increase the recognition quality, and
obtained practical results are shown.
Keywords—speech recognition, cepstral analysis, Voice
activation detection algorithm, Mel Frequency Cepstral
Coefficients, features of speech, Cepstral Mean Subtraction,
neural networks, Linear Predictive Coding
R
I.
INTRODUCTION
ECENTLY as a result of wide development of computers,
the various forms of information exchange between man
and computer are discovered. At present, inputting the
data into the computer by the speech and its recognition by the
computer is one of the developed scientific fields. Because
each language has its specific features, the various speech
recognition systems are investigated for the different
languages.
Kamil Aida-Zade is with the institute of Cybernetics of the National Academy
of Sciences, Baku, Azerbaijan.
Cemal Ardil is with the National Academy of Aviation, Baku, Azerbaijan.
Samir Rustamov is with the institute of Cybernetics of the National Academy
of Sciences, Baku, Azerbaijan.
This is why we propose speech recognition system for the
Azerbaijani language.
The subject of this paper is about the construction of
structured Azerbaijan speech recognition system, analysis of
investigating the speech recognition system, and recognition
result. The speech inputted to our system consists of finite
number of words clearly expressed with definite time interval.
The recognizable words (speech) depending on applied fields
can be used for various purposes.
II. PROBLEM STATEMENT
Automatic speech recognition by computer is a process
where speech signals are automatically converted into the
corresponding sequence of words in text.
Automatic speech recognition involves a number of
disciplines such as physiology, acoustics, signal processing,
pattern recognition, and linguistics. The difficulty of
automatic speech recognition is coming from many aspects of
these areas.
Variability from speakers: A word may be uttered
differently by the same speaker because of illness or emotion.
It may be articulated differently depending on whether it is
planned read speech or spontaneous conversation. The speech
produced in noise is different from the speech produced in a
quiet environment because of the change in speech production
in an effort to communicate more effectively across a noisy
environment. Since no two persons share identical vocal cords
and vocal tract, they cannot produce the same acoustic signal.
Typically, females sound is different from males. So do
children from adults. Also, there is variability due to dialect
foreign accent.
Variability
from
environments:
The
acoustical
environment where recognizers are used to introduce another
layer of corruption in speech signals. This is because of
background noise, reverberation, microphones, and
transmission channels.
III. THE METHODS OF SOLUTION
At first the speech signal is transformed into electric
oscillation by the sound recorders (for example, microphone).
Later the signal passed over analog-digital converter is
transformed into digital form at some sampling frequency f d
and quantization level. The sampling frequency - analog
signal without losing its important information determines the
necessary frequency for sampling.
The main part of speech recognition system consists of
training and recognition processes. Initially basic features
36
2. International Science Index 19, 2008 waset.org/publications/7293
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
characterizing speech signal are computed in both processes.
The efficiency of this stage is one of the significant factors
affecting behavior of the next stages and exactness of speech
recognition. Using the time function of the signal as feature is
ineffective. The reason for this is that when the same person
says the same word, its time function varies significantly.
At present the methods of calculating MFCC (Mel
Frequency Cepstral Coefficients) and LPC (Linear Predictive
Coding) are widely used in speech recognition as speech
features.
Let’s explain the essence of these methods, separately:
The model of speech generation consists of two parts: the
generation of the excitation signal and the vocal tract filter.
The excitation signal is spectrally shaped by a vocal tract
equivalent filter. The outcome of this process is the speech. If
e(n) denotes a sequence of the excitation signal and θ (n)
denotes the impulse response of the vocal tract equivalent
filter, a sequence of the speech is then equal to the excitation
signal convolved with the impulse response of the vocal tract
filter as shown in equation (3.1).
s ( n ) = e( n ) * θ ( n )
(1)
A convolution in the time domain corresponds to a
multiplication in the frequency domain:
S (ω ) = E (ω ) ⋅ θ (ω )
(2)
In MFCC method using the logarithm of equation (2), the
multiplied spectra becomes additive
log S (ω ) = log E (ω ) ⋅ θ (ω ) = log E (ω ) + log θ (ω ) .
It is possible to separate the excitation spectrum E (ω )
from the vocal system spectrum θ (ω ) by remembering that:
E (ω ) is responsible for the “fast” spectral variations, θ (ω ) is
responsible for the “slow” spectral variations. Frequency
components corresponding to E (ω ) appear at “large values”
on the horizontal axis in the “new frequency domain”,
whereas frequency components corresponding to θ (ω ) appear
at “small values”. The new domain found after taking the
logarithm and the inverse Fourier transform is called the
cepstrum domain, and the word quefrency is used for
describing the “frequencies” in the cepstrum domain.
As same way Z transform is applied to the convolution in
the time domain in method LPC:
S ( z) = E ( z) ⋅θ ( z)
The main idea behind linear prediction is to extract the
vocal tract parameters. Given a speech samples at time n ,
s(n ) can be modeled as a linear combination of the past p
speech samples, such that:
p
)
s ( a; n) = ∑ s (n − k ) ⋅ a p (k )
p
The filter A( z ) = 1 + ∑ a p (k ) z −k is called the predicting
k =1
error filter. This filter is equal to the inverse value of vocal
tract equivalent filter.
1
.
A( z ) =
θ ( z)
To find the vocal tract filter θ ( z ) , we must first find the
LPC coefficients a p . By this aim, the following function is
minimized
ε p (a ) = ∑ e(a; n) → min
M
a p = (a p (1),a p (2),..., a p ( p))
are
unknown
where M is a number of frames.
We use the necessary condition of minimum to solve the
problem:
M
M
∂ε p (a )
∂
∂
2
=
e ( a; n ) =
∑ e(a; n) = 2 ∑ e(a; n)
∂a p ( k ) ∂a p ( k ) n =1
∂a p (k )
n =1
M
= 2 ∑ e ( a; n )
n =1
k =1
Apply to this signal the z − transform: R ( z ) = S ( z ) A( z ) .
p
∂ ⎡
⎤
⎢ s (n) + ∑ a p (l ) s (n − l ) ⎥ =
∂a p (k ) ⎣
l =1
⎦
M
= 2 ∑ e ( a; n ) s ( n − k ) = 0 ,
k = 1,2,..., p
n =1
Then we get
p
M
⎡
⎤
∑ ⎢ s(n) + ∑ a p (l ) s(n − l )⎥ s (n − k ) = 0 .
n =1 ⎣
l =1
⎦
(5)
M
Let’s denote rx (k ) = ∑ s (n) s ( n − k ) . Consequently we can
n =1
write the equation (3.5) as following form.
p
p
l =1
l =1
rx (k ) + ∑ a p (l )rx (l − k ) = 0 or ∑ a p (l )rx (k − l ) = − rx ( k ) ,
k = 1,..., p .
(6)
The equation (6) is called the normal equation or the YuleWalker equation.
Using the expression (3) in the functional (4), we get:
ε p (a ) = ∑ e(a; n) = ∑ e(a; n)e(a; n) =
M
2
n =1
M
n =1
p
⎡
⎤
= ∑ e( a; n) ⎢ s(n) + ∑ a p ( k )s (n − k ) ⎥ =
n =1
k =1
⎣
⎦
M
p
M
k =1
M
n =1
= ∑ e(a; n) s(n) + ∑ a p ( k ) ∑ e(a; n) s (n − k ) .
n =1
While
M
∑ e(a; n) s(n − k ) = 0 , we can write the functional
n =1
(4) as following form.
M
M
⎡
p
⎤
n =1
n =1
⎣
k =1
⎦
ε p , min (a ) = ε p (a ) = ∑ e(a; n) s (n) = ∑ ⎢ s (n) + ∑ a p (k ) s (n − k )⎥ s (n) =
LPC
coefficients ( p ∈ [8,12]) .
Summing the real and predicted samples we get the
following signal:
p
)
e(a; n) = s(n) + s (a; n) = s(n) + ∑ a p (k ) ⋅ s (n − k )
(3)
(4)
n =1
k =1
where
2
p
= rx (0) + ∑ a p (k )rx (k ).
k =1
The coefficients a p (k ) , which giving the minimum to the
functional is found by using following Levinson-Durbin
recursion.
b) E0 = rx (0)
1. a) a0 (0) = 1
2. For j = 0,1,... p − 1 calculated the following
37
3. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
expressions:
j
a) γ j = rx ( j + 1) + ∑ a j (i )rx ( j − i + 1)
i =1
b) Γ j +1 = −γ j / E j
c) i = 1,2,..., j
a j +1 (i ) = a j (i ) + Γ j +1a j ( j − i + 1)
d) a j +1 ( j + 1) = Γ j +1
2
e) E j +1 = E j ⎡1 − Γ j +1 ⎤ .
⎢
⎥
⎣
⎦
International Science Index 19, 2008 waset.org/publications/7293
IV.ALGORITHM
OF CALCULATION OF SPEECH
FEATURES
The combined use of LPC and MFCC cepstrals in speech
recognition system is for calculating speech features.
Calculation of the speech features algorithm is defined in the
following form.
1. Pre-processing. The amplitude spectrum of a speech
signal is dominant at “low frequencies” (up to
approximately 4 kHz ). The speech signals is passed through a
first-order FIR high pass filter:
s p (n) = sin ( n) − α ⋅ sin (n − 1)
where α − is the filter coefficient (α ∈ (0,95 ;1)) , sin (n) − is
the input signal.
2. Voice activation detection (VAD). The problem of
locating the endpoints of an utterance in a speech signal is a
major problem for the speech recognizer. An inaccurate
endpoint detection will decrease the performance of the
speech recognizer. Some commonly used measurements for
finding speech are short-term energy estimate E s , or shortterm power estimate Ps , and short term zero crossing rate Z s .
For the speech signals s p (n ) these measures are calculated as
Using this function both the short-term power and the zero
crossing rate will be taken into account. S c is a scale factor
for avoiding small values, in a typical application is
S c = 1000 . The trigger for this function can be described as:
tW = µW + αδ W
the µ w is the mean and δ w is the variance for W s (m )
calculated for the first 5 blocks. The α term is constant that
have to be fine tuned according to the characteristics of the
signal. After some testing the following approximation of α
will give a pretty good voice activation detection in various
level of additive background noise.
−
α = 0,2 ⋅ δ W0, 4 .
The voice activation detection function, VAD(m ) , can be
found as:
⎧1, Ws ( m) ≥ tW ,
VAD(m) = ⎨
⎩0, Ws ( m) < tW .
By using this function we can detect the endpoints of an
utterance.
3. Framing. The input signal is divided into overlapping
frames of N samples.
s frame (n) = s p (n) ⋅ w(n) ,
⎧1, K ⋅ r < n ≤ K ⋅ r + N , r = 0,1,2,..., M − 1,
w( n) = ⎨
⎩0, otherwise,
where M is the number of frames, f s is the sampling
frequency, t frame is the frame length measured in time, and K
is the frame step.
N = f s ⋅ t frame .
TABLE I
VALUES OF FRAME LENGTH AND FRAME STEP INTERVAL
DEPENDING ON THE SAMPLING FREQUENCY
follows:
E s ( m) =
m
∑ s 2 ( n) ,
p
n = m − L +1
Z s ( m) =
Ps (m) =
Sampling
frequency ( f s )
1 m 2
∑ s p ( n) ,
L n=m− L +1
Frame length ( N )
Frame step ( K )
1 m sgn( s p ( n)) − sgn( s p (n − 1))
∑
L n=m − L +1
2
where
⎧1, s p (n) ≥ 0 ,
⎪
sgn( s p (n)) = ⎨
⎪− 1, s p (n) < 0 .
⎩
For each block of L = 100 samples these measures
calculate some value. The short term zero crossing rate gives a
measure of how many times the signal, s p (n ) , changes sign.
This short term zero crossing rate tends to be larger during
unvoiced regions.
These measures will need some triggers for making
decision about where the utterances begin and end. To create a
trigger, one needs some information about the background
noise. This is done by assuming that the first 5 blocks are
background noise. With this assumption, the mean and
variance for the measures will be calculated. To make a more
comfortable approach, the following function is used:
Ws (m) = Ps (m) ⋅ (1 − Z s (m)) ⋅ S c .
f s = 16kHs
400
160
f s = 11kHs
f s = 8kHs
256
110
200
80
We use the f s = 16kHs sampling frequency in our system.
4. Windowing. There are a number of different window
functions to choose between to minimize the signal
discontinuities. One of the most commonly used for
windowing a speech signal before Fourier transformation, is
the Hamming window:
⎧
⎛ 2π (n − 1) ⎞⎫
s w (n) = ⎨0,54 − 0,46 cos⎜
⎟⎬s frame (n), 1 ≤ n ≤ N .
⎝ N − 1 ⎠⎭
⎩
Calculating of MFCC features.
Fast Fourier transform(FFT). Applying by FFT to
windowing frames are calculated spectrum of frames.
N
bink = ∑ s w (n)e
−i ( n −1) k
2π
N
, k = 0,1,2,..., N − 1 .
n =1
Mel filtering. The low-frequency components of the
magnitude spectrum are ignored. The useful frequency band
38
4. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
lies between 64 Hz and half of the actual sampling frequency.
This band is divided into 23 channels equidistant in mel
frequency domain. Each channel has triangular-shaped
frequency window. Consecutive channels are halfoverlapping.
The choice of the starting frequency of the filter bank,
f start = 64 Hz , roughly corresponds to the case where the full
frequency band is divided into 24 channels and the first
channel is discarded using any of the three possible sampling
frequencies.
The centre frequencies of the channels in terms of FFT
bin indices ( cbini for the i -th channel) are calculated as
follows:
⎛ mel
⎞
x ⎞
⎛
Mel ( x) = 2595 lg⎜1 +
⎟, x = 700 ⋅ ⎜10 2595 − 1⎟ ,
⎜
⎟
⎝ 700 ⎠
⎝
⎠
Mel{ f s 2} − Mel{ f start } ⎫
⎧
f c = Mel −1 ⎨Mel{ f start } +
i ⎬,
NF
⎩
⎭ ,
is different from the channel effect in later recordings when
the person uses the system. The problem is that a false
distance between the training data and newly recorded data is
introduced due to the different channel effects. The channel
effect is eliminated by subtracting the mel-cepstrum
coefficients with the mean mel-cepstrum coefficients:
1 M
mc j ( q ) = C j (q ) −
∑ C i (q), q = 1,2, ,...,12
M i =1
Calculating of LPC features.
The LPC coefficients of each frame are found by applying
Levinson-Durbin algorithm and following cepstrals are
calculated.
k −1
i⎞
⎛
c(k ) = − a p (k ) − ∑ ⎜1 − ⎟a p (i )c(k − i ) ,
k⎠
i =1 ⎝
We apply the cepstral mean subtraction to these 12 LPC
cepstrals and enter to the feature vector in next step.
V. CONSTRUCTION OF NEURAL NETWORK
i
International Science Index 19, 2008 waset.org/publications/7293
i = 1,2,3,..., NF − 1
⎧ fc ⎫
cbini = round ⎨
N⎬ ,
⎩ fs ⎭
where round (⋅) stands for rounding towards the nearest
integer. NF = 24 -is the number of channels of filter.
The output of the mel filter is the weighted sum of the FFT
magnitude spectrum values (bini ) in each band. Triangular,
half-overlapped windowing is used as follows:
cbin
i − cbin k −1 + 1
bini +
fbank k = ∑
cbin k − cbink −1 + 1
i = cbin
i
k
k −1
⎞
⎛
i − cbin k
⎟
+ ∑ ⎜1 −
⎜ cbin − cbin + 1 ⎟bini , k = 1,2,..., NF − 1.
i = cbin +1 ⎝
k +1
k
⎠
where cbin0 and cbin24 denote the FFT bin indices
corresponding to the starting frequency and half of the
sampling frequency, respectively,
⎧f
⎫
cbin0 = round ⎨ start N ⎬ ;
⎩ fs
⎭
cbink +1
There are various mathematical models which form the
basis of speech recognition systems. The widely used model is
Multilayer Artificial Neural Network (MANN). Let’s briefly
describe the structure of MANN.
Generally, MANN is incompletely connected graph. Let
L – quantity of MANN’s layers, N l - neuron quantity on
−
layer l , l = 1..L ; I lj - set of neurons of layer (l − 1) , which
connected to the neuron j on layer l ; θ lj - bias of neuron j
l
on layer l ; wij - weighted coefficient (synapse) of connection
between of neuron i on layer (l − 1) and neuron j on
layer l ; s lj , p and y lj , p - state and output value of neuron j
on layer l for input signal x p ∈ X of MANN.
k
cbin 24
⎧f 2 ⎫ N
= round ⎨ s N ⎬ = .
⎩ fs
⎭ 2
Non-linear transformation. The output of mel filtering is
subjected to a logarithm function (natural logarithm)
f i = ln( fbank i ), i = 1,2,..., NF − 1 .
Cepstral coefficients. 12 cepstral coefficients are
calculated from the output of the non-linear transformation
block.
NF −1
⎛ π ⋅i
⎞
( j − 0.5) ⎟, i = 1,..,12 .
Ci = ∑ f j ⋅ cos⎜
NF − 1
j =1
⎝
⎠
We apply to these 12 LPC cepstrals the cepstral mean
subtraction and enter to the feature vector in next step.
Cepstral Mean Subtraction (CMS). A speech signal may
be subjected to some channel noise when recorded, also
referred to as the channel effect. A problem arises if the
channel effect when recording training data for a given person
k = 1,...12 .
Forward propagation of MANN for x p ∈ X input signal
has been described by the following expressions (figures 1,2):
l
s lj , p = ∑ wij ⋅ yil,−p1 + θ lj ,
(7)
−
i∈I lj
y lj , p = f ( s lj , p ) ,
j = 1,..., N l , l = 1,..., L ,
y 0, p = x j , p ,
j
j = 1,..., N 0 ,
(8)
(9)
where f (⋅) - given nonlinear activation function. As
activation function logistic or hyperbolic tangent functions
can be used:
e z − e−z
1
.
f log ( z ) =
, f tan ( z ) = z
−αz
e + e−z
1+ e
Their derivation can be calculated by function value:
df log ( z )
df tan ( z )
2
= α ⋅ f log ( z ) ⋅ (1 − f log ( z )) ,
= 1 − f tan ( z ) .
dz
dz
Let, the training set of {x p , d p }, p = 1..P pairs are given,
where d p = ( d1, p ,..., d N
L,p
) – desired output for x p input
l
signal. The training of MANN consists in finding such wij
−
and θ l i ∈ I lj , j = 1,..., N l , l = 1,..., L , herewith on x p input
j
signal that MANN has output y p , which maximal closed to
39
5. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
desired output d p . Usually, training quality is defined by
mean square error function:
1 P
E ( w,θ ; x, s, y ) = ∑η p E p ( w,θ ; x p , s p , y p ) ,
P p =1
(10)
2
1 NL
E p ( w,θ ; x p , s p , y p ) = ∑ y L, p − d j , p ,
j
2 j =1
where η p – coefficient, which determine the belonging
(
“quality”
of
input
xp
)
to
its
“ideal”
pattern
International Science Index 19, 2008 waset.org/publications/7293
p = 1,..., P , j = 1,..., N L .
The task of MANN training constitutes minimization of
criterion (10) according to parameters ( w,θ ) with (7)-(9)
conditions. The MANN of developed system was trained by
conjugate gradient method.
Fig. 1 MANN with two layers
person repeats the same speech, it has the different time
durations. For partially removing the problem, time durations
are led to the same scale. When the dimension of scale defined
for the speech signal increases, then the dimension of feature
vector corresponding to the signal also increases.
The dimension of neural network is taken as a total
number of weights and biases of neural network. The large
dimension of the feature vector acts strongly on the dimension
of neural network. For example, our neural network consists
of 2 layers: a number of inputted parameters are 420, a
number of neurons in the first layer are 50, a number of
neurons in output layer are 10. The dimension of our neural
network is 420 × 50 + 50 × 10 + 60 = 21560 .
Since the dimension of neural network is less than the
number of trained samples, there exists a set of various
weights and biases giving a minimum to minimization
criterion (10), such that the application of these weights and
biases to the recognition system gives the different results.
Here the construction of the following system is suggested
by application of neural networks trained from different initial
points. The speech recognition system depending on the aim
of a user presents him a recognition system of different
quality. The recognition systems with respect to the factor of
error recognition percent are conditionally called strong,
intermediate and weak reliability systems.
Strong reliability system. This is a system confirming the
recognition by each neural network trained from different
initial points. If some of these networks discard the
recognition, then the system doesn’t accept any recognition.
This system prevents the error in recognition process and
therefore is more reliable.
Fig. 2 One neuron description
VI. THE RECOGNITION PROCESS
The speech recognition system consists of MFCC and
LPC-based two subsystems. These subsystems are trained by
neural networks with MFCC and LPC features, respectively.
The recognition process is realized by two stages:
1. In MFCC and LPC–based recognition subsystems
recognition processes are realized in parallel.
2. The recognition results of MFCC and LPC–based
recognition subsystems are compared and the speech
recognition system confirms the result, which
confirmed by the both subsystems.
Since the MFCC and LPC methods are applied to the
overlapping frames of speech signal, the dimension of feature
vector depends on dimension of frames. At the same time, the
number of frames depends on the length of speech signal,
sampling frequency, frame step, frame length. In our system
the sampling frequency is 16 khs , the frame step is 160
samples, and the frame length is 400 samples.
The other problem of speech recognition is the same
speech has different time duration. Even when the same
Intermediate reliability system. This system uses voting
between neural networks trained from different initial points,
and recognition system confirms the result of the voting. For
example, if the number of neural networks trained by
changing the initial points are 3, then the system accepts the
same result confirmed by some two networks of them. In spite
of the fact that the confidence of the system is lower than
“strong reliability system”, the recognition percent is high.
Weak reliability system. Our suggested method in this
system is a sequential method. Let’s explain the main essence
of the method. The recognition system is trained from
different initial points in the trained process. First trained
neural network is used initially, then second neural network is
applied to the by the first network. Similarly the third neural
network is applied to the unrecognized patterns by the second
neural network and so on. This approach minimizes the
number of unrecognized patterns. However, it has got weak
reliability in terms of error rate.
Note that the number of neural networks trained from the
different initial points depends on the computer processing
power. Apparently when the number of neural networks used
in the system increases, the strong reliability system
minimizes the error and the recognition reliability of the
system increases. In weak reliability system despite of
increasing the correct recognition percent, the error
40
6. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
312
MFCC
LPC
Combined
The number of error
recognized patterns
291 (93.27%)
288 (92.31%)
7 (2.24%)
5 (1.6%)
312
MFCC
288 (92.31%)
9 (2.88%)
312
LPC
MFCC
292 (93.59%)
289 (92.63%)
7 (2.24%)
5 (1.6%)
13 (4.17%)
18 (5.77%)
LPC
293 (93.91%)
6 (1.92%)
13 (4.17%)
MFCC
LPC
Combined
273 (87.5%)
286 (91.61%)
264 (84.6%)
1 (0.32%)
2 (0.64%)
1(0.32%)
The number of
unrecognized
patterns
The number of
error recognized
patterns
The numbers of
recognized
patterns
The types of
features
TABLE III
THE RESULTS OF THE STRONG RELIABILITY SYSTEM
312
The number of
unrecognized
patterns
The number of
error recognized
patterns
The number of
unrecognized
patterns
The number of
error recognized
patterns
The numbers of
recognized
patterns
MFCC
LPC
299 (95.83%)
296 (94.87%)
10 (3.2%)
10 (3.2%)
3 (0.96%)
6 (1.92%)
294(94.23%)
4(1.28%)
14(4.49%)
K.R.Ayda-zade, S.S.Rustamov. Research of Cepstral Coefficients
for Azerbaijan speech
recognition system. Transactions of
Azerbaijan National Academy of sciences.”Informatics and control
problems”. Volume XXV, №3. Baku, 2005, p.89-94.
[2] K.Р.Айда-заде, Э.Э.Мустафаев. Об оптимизации параметров
нейронной сети на этапе ее обучения / Труды
Республиканской научной конференции «Современные
проблемы информатизации, кибернетики и информационных
технологий», том I, Баку, 2003, с. 118-121.
[3] Mikael Nilsson,Marcus Ejnarsson. “Speech Recognition using
Hidden Markov Model”.Department of Telecommunications and
Speech Processing, Blekinge Institute of Technology. 2002.
http://www.hh.se/staff/maej/publications/MSc Thesis - MiMa.pdf
[4] Group 622 “On Speaker Verification”. 2004. 198 p.
http://www.control.auc.dk/~jhve02/report_inf6.pdf
[5] А.Б.Сергиенко. Цифровая обработка сигналов. СПб.: Питер,
2002, 608 с.
[6] ETSI ES 201 108 v1.1.2 (2000-04). “Speech Processing,
Transmission and Quality aspects(STQ); distributed speech
recognition; Front-end feature extraction algorithm; Compression
algorithms”. 20 p.
http://www.3gpp.org/ftp/TSG_SA/TSG_SA/TSGS_13/docs/PDF/S
P-010566.pdf
[7] Bengt Mandersson. Chapter 4. “Signal Modeling”.Department of
Electroscience. Lund University. August 2005.
http://www.tde.lth.se/ugradcourses/osb/osb05_f2_a4.pdf
[8] Bengt Mandersson. Chapter 5. “Levinson-Durbin Recursion”.
Department of Electroscience. Lund University. September 2005.
http://www.tde.lth.se/ugradcourses/osb/osb05_f3_a4.pdf
[9] Group 11. Tejaswini Hebalkar, Lee Hotraphinyo, Richard Tseng.
“Voice Recognition and Identification System”. Digital
communications and Signal Processing Systems Design. June
2000.
http://www.ece.cmu.edu/~ee551/Final_Reports/Gr11.551.S00.pdf
[10] Bengt Mandersson. Chapter 4. “Signal Modeling”.Department of
Electroscience.
Lund
University.
August
2005.
http://www.tde.lth.se/ugradcourses/osb/osb05_f2_a4.pdf
15 (4.81%)
3
15 (4.8%)
17 (5.45%)
24(7.69%)
REFERENCES
14 (4.49%)
19 (6.09%)
2
3 (0.96%)
5 (1.6%)
2(0.64%)
Combined
The number of
unrecognized
patterns
The numbers of
testing patterns
The numbers of
recognized patterns
The types of features
Number of training
MFCC
LPC
The types of
features
The numbers of
testing patterns
312
[1]
312
294 (94.23%)
290 (92.95%)
286(91.67%)
TABLE V
THE RESULTS OF THE WEAK RELIABILITY SYSTEM
TABLE II
THE RESULTS OF THE MFCC AND LPC-BASED SPEECH
RECOGNITION SUBSYSTEMS TRAINING FROM DIFFERENT INITIAL
POINTS
1
The numbers of
recognized
patterns
The numbers of
testing patterns
For training and recognition of the different scaled speech
in neural network, it is necessary to lead them to the same
scale. The dimension of scale has taken 5840 samples, which
corresponds to the 35 frames. Every frame has 12 features.
Our neural network consists of 2 layers: a number of input
parameters are 420, a number of neurons in the first layer is
50, a number of neurons in output layer is the number of
testing words (10). Testing speech is taken by Azerbaijani
digits.
For training process from every speech form digit are
entered 140-150 patterns to the system. The neural networks
of developed system were trained by conjugate gradient
method. In following tables MFCC and LPC-based
subsystems results are shown separately. Results of speech
recognition system, which combined use the MFCC and LPCbased subsystems are also shown.
The numbers of
testing patterns
International Science Index 19, 2008 waset.org/publications/7293
VII. EXPERIMENTAL RESULTS
The types of
features
TABLE IV
THE RESULTS OF THE INTERMEDIATE RELIABILITY SYSTEM
recognition percent also relatively increases and the reliability
of the system decreases. The number of neural networks
doesn’t affect the results of intermediate reliability system.
38 (12.18%)
24 (7.69%)
47(15.1%)
41
7. World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:2 No:7, 2008
[11] Химмельблау Д. Прикладное нелинейное программирование.
М.: Мир, 1975, 534 с.
International Science Index 19, 2008 waset.org/publications/7293
Kamil Aida-Zade is with the institute of Cybernetics of the National
Academy of Sciences, Baku, Azerbaijan.
Cemal Ardil is with the National Academy of Aviation, Baku,
Azerbaijan.
Samir Rustamov is with the institute of Cybernetics of the National
Academy of Sciences, Baku, Azerbaijan.
42