The document summarizes techniques for text-independent speaker recognition from audio signals. It discusses the principles of automatic speaker recognition including identification and verification. The key steps are voice recording, feature extraction using MFCC, modeling reference models for each speaker, pattern matching of input features against models, and making an identification or verification decision. Feature extraction involves framing the audio, windowing, FFT, mapping frequencies to the mel scale, and taking the DCT to produce cepstral coefficients.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
In the field of speech signal processing, Spectral subtraction method (SSM) has been successfully implemented to suppress the noise that is added acoustically. SSM does reduce the noise at satisfactory level but musical noise is a major drawback of this method. To implement spectral subtraction method, transformation of speech signal from time domain to frequency domain is required. On the other hand, Wavelet transform displays another aspect of speech signal. In this paper we have applied a new approach in which SSM is cascaded with wavelet thresholding technique (WTT) for improving the quality of speech signal by removing the problem of musical noise to a great extent. Results of this proposed system have been simulated on MATLAB.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
In the field of speech signal processing, Spectral subtraction method (SSM) has been successfully implemented to suppress the noise that is added acoustically. SSM does reduce the noise at satisfactory level but musical noise is a major drawback of this method. To implement spectral subtraction method, transformation of speech signal from time domain to frequency domain is required. On the other hand, Wavelet transform displays another aspect of speech signal. In this paper we have applied a new approach in which SSM is cascaded with wavelet thresholding technique (WTT) for improving the quality of speech signal by removing the problem of musical noise to a great extent. Results of this proposed system have been simulated on MATLAB.
An Optimized Transform for ECG Signal CompressionIDES Editor
A significant feature of the coming digital era is the
exponential increase in digital data, obtained from various
signals specially the biomedical signals such as
electrocardiogram (ECG), electroencephalogram (EEG),
electromyogram (EMG) etc. How to transmit or store these
signals efficiently becomes the most important issue. A digital
compression technique is often used to solve this problem.
This paper proposed a comparative study of transform based
approach for ECG signal compression. Adaptive threshold is
used on the transformed coefficients. The algorithm is tested
for 10 different records from MIT-BIH arrhythmia database
and obtained percentage root mean difference as around
0.528 to 0.584% for compression ratio of 18.963:1 to 23.011:1
for DWT. Among DFT, DCT and DWT techniques, DWT has
been proven to be very efficient for ECG signal coding.
Further improvement in the CR is possible by efficient
entropy coding.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
This is my presentation on a Journal Club. It's based on the article: "Auditory inspired machine learning techniques can improve speech intelligibility and quality for hearing-impaired listeners". You can find all the references in the slide at the end of the article. I review very basic techniques in noise reduction, and how the techniques are implemented in the area of deep neural-network.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
Realization and design of a pilot assist decision making system based on spee...csandit
A system based on speech recognition is proposed fo
r pilot assist decision-making. It is based
on a HIL aircraft simulation platform and uses the
microcontroller SPCE061A as the central
processor to achieve better reliability and higher
cost-effect performance. Technologies of
LPCC (linear predictive cepstral coding) and DTW (D
ynamic Time Warping) are applied for
isolated-word speech recognition to gain a smaller
amount of calculation and a better real-time
performance. Besides, we adopt the PWM (Pulse Width
Modulation) regulation technology to
effectively regulate each control surface by speech
, and thus to assist the pilot to make decisions.
By trial and error, it is proved that we have a sat
isfactory accuracy rate of speech recognition
and control effect. More importantly, our paper pro
vides a creative idea for intelligent human-
computer interaction and applications of speech rec
ognition in the field of aviation control. Our
system is also very easy to be extended and applied
Design and implementation of different audio restoration techniques for audio...eSAT Journals
Abstract
Audio signals are corrupted with many types of distortions. Major audio distortions are categorized into Globalized and
Localized distortions. Localized distortion includes clipping and clicks where only certain samples are affected and globalized
distortions include broadband noise where complete bandwidth is consumed by noise. Audio restoration is a technique for giving
back the audio signals from these distortions. In this paper, audio restoration techniques for removing clipping, clicks and
broadband noise are put forwarded. Recent approaches to solving audio restoration problem is with respect to sparse
representation algorithms. Clipping distortion is addressed with a Sparse representation framework, it is treated as a reverse
problem, where the distorted samples is estimated from the surrounding undistorted samples, they are embedded in frame based
scheme, and reconstructed by using an overlap add method in conjunction with OMP algorithm and Gabor/DCT dictionary for
modelling audio signals. Broadband denoising is done by using spectral subtraction and Click removal is done by using an
adaptive filter method as the first step. Performance measures are done based on perception, average SNR calculation and
defined parameter variations. This paper also targeting towards the software and hardware implementation of the restoration
methods using TMS320C6713 DSK kit with help of tools mainly MATLAB and Code Composer studio.
Key Words: Audio Distortions, OMP algorithm, Gabor/DCT dictionary, TMS320C6713DSK
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
An Optimized Transform for ECG Signal CompressionIDES Editor
A significant feature of the coming digital era is the
exponential increase in digital data, obtained from various
signals specially the biomedical signals such as
electrocardiogram (ECG), electroencephalogram (EEG),
electromyogram (EMG) etc. How to transmit or store these
signals efficiently becomes the most important issue. A digital
compression technique is often used to solve this problem.
This paper proposed a comparative study of transform based
approach for ECG signal compression. Adaptive threshold is
used on the transformed coefficients. The algorithm is tested
for 10 different records from MIT-BIH arrhythmia database
and obtained percentage root mean difference as around
0.528 to 0.584% for compression ratio of 18.963:1 to 23.011:1
for DWT. Among DFT, DCT and DWT techniques, DWT has
been proven to be very efficient for ECG signal coding.
Further improvement in the CR is possible by efficient
entropy coding.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
This is my presentation on a Journal Club. It's based on the article: "Auditory inspired machine learning techniques can improve speech intelligibility and quality for hearing-impaired listeners". You can find all the references in the slide at the end of the article. I review very basic techniques in noise reduction, and how the techniques are implemented in the area of deep neural-network.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
Realization and design of a pilot assist decision making system based on spee...csandit
A system based on speech recognition is proposed fo
r pilot assist decision-making. It is based
on a HIL aircraft simulation platform and uses the
microcontroller SPCE061A as the central
processor to achieve better reliability and higher
cost-effect performance. Technologies of
LPCC (linear predictive cepstral coding) and DTW (D
ynamic Time Warping) are applied for
isolated-word speech recognition to gain a smaller
amount of calculation and a better real-time
performance. Besides, we adopt the PWM (Pulse Width
Modulation) regulation technology to
effectively regulate each control surface by speech
, and thus to assist the pilot to make decisions.
By trial and error, it is proved that we have a sat
isfactory accuracy rate of speech recognition
and control effect. More importantly, our paper pro
vides a creative idea for intelligent human-
computer interaction and applications of speech rec
ognition in the field of aviation control. Our
system is also very easy to be extended and applied
Design and implementation of different audio restoration techniques for audio...eSAT Journals
Abstract
Audio signals are corrupted with many types of distortions. Major audio distortions are categorized into Globalized and
Localized distortions. Localized distortion includes clipping and clicks where only certain samples are affected and globalized
distortions include broadband noise where complete bandwidth is consumed by noise. Audio restoration is a technique for giving
back the audio signals from these distortions. In this paper, audio restoration techniques for removing clipping, clicks and
broadband noise are put forwarded. Recent approaches to solving audio restoration problem is with respect to sparse
representation algorithms. Clipping distortion is addressed with a Sparse representation framework, it is treated as a reverse
problem, where the distorted samples is estimated from the surrounding undistorted samples, they are embedded in frame based
scheme, and reconstructed by using an overlap add method in conjunction with OMP algorithm and Gabor/DCT dictionary for
modelling audio signals. Broadband denoising is done by using spectral subtraction and Click removal is done by using an
adaptive filter method as the first step. Performance measures are done based on perception, average SNR calculation and
defined parameter variations. This paper also targeting towards the software and hardware implementation of the restoration
methods using TMS320C6713 DSK kit with help of tools mainly MATLAB and Code Composer studio.
Key Words: Audio Distortions, OMP algorithm, Gabor/DCT dictionary, TMS320C6713DSK
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
Speech Recognition Systems(SRS) have been implemented by various processors including the digital signal processors(DSPs) and field programmable gate arrays(FPGAs) and their performance has been reported in literature. The fundamental purpose of speech is communication, i.e., the transmission of messages.In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired.The recognition of speech requires feature extraction and classification. The systems that use speech as input require a microcontroller to carry out the desired actions. In this paper, Cypress Programmable System on Chip (PSoC) has been studied and used for implementation of SRS. From all the available PSoCs, PSoC5 containing ARM Cortex-M3 as its CPU is used. The noise signals are firstly nullified from the speech signals using LogMMSE filtering. These signals are then sent to the PSoC5 wherein the speech is recognized and desired actions are performed.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...IJCSEA Journal
Speech is the most natural way of information exchange. It provides an efficient means of means of manmachine communication using speech interfacing. Speech interfacing involves speech synthesis and speech recognition. Speech recognition allows a computer to identify the words that a person speaks to a microphone or telephone. The two main components, normally used in speech recognition, are signal processing component at front-end and pattern matching component at back-end. In this paper, a setup that uses Mel frequency cepstral coefficients at front-end and artificial neural networks at back-end has been developed to perform the experiments for analyzing the speech recognition performance. Various experiments have been performed by varying the number of layers and type of network transfer function, which helps in deciding the network architecture to be used for acoustic modelling at back end.
Suppression of noise in noisy speech signal is required in many speech enhancement applications like signal recording and transmission from one place to other. In this paper a novel single line noise cancellation system is proposed using derivative of normalized least mean spare algorithm. The proposed system has two phases. The first phase is generation of secondary reference signal from incoming primary signal itself at initial silence period and pause between two words, which is essential while adaptive filter using as noise canceller. Second phase is noise cancellation using proposed modified error data normalized step size (EDNSS) algorithm. The performance of the proposed algorithm is compared with normalized least mean square (NLMS) algorithm and original EDNSS algorithm using standard IEEE sentence (SP23) of Noizeus data base with different types of real-world noise at different level of signal to noise ratio (SNR). The output of proposed, NLMS and EDNSS algorithm are measured with output SNR, excessive mean square error (EMSE) and misadjustment (M). The results clearly illustrates that the proposed algorithm gives improved result over conventional NLMS and EDNSS algorithm. The speed of convergence is also maintained as same conventional NLMS algorithm.
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
In this paper, we address the speaker independent
recognition of Chinese number speeches 0~9 based on HMM.
Our former results of inside and outside testing achieved
92.5% and 76.79% respectively. To improve further the
performance, two important features of speech; MFCC and
cluster number of vector quantification, are unified together
and evaluated on various values. The best performance
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ
clustering number = 64.
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
In this paper, a system, developed for speech encoding, analysis, synthesis and gender identification is
presented. A typical gender recognition system can be divided into front-end system and back-end system.
The task of the front-end system is to extract the gender related information from a speech signal and
represents it by a set of vectors called feature. Features like power spectrum density, frequency at
maximum power carry speaker information. The feature is extracted using First Fourier Transform (FFT)
algorithm. The task of the back-end system (also called classifier) is to create a gender model to recognize
the gender from his/her speech signal in recognition phase. This paper also presents the digital processing
of a speech signals (pronounced “A” and “B”) which are taken from 10 persons, 5 of them are Male and
the rest of them are Female. Power Spectrum Estimation of the signal is examined .The frequency at
maximum power of the English Phonemes is extracted from the estimated power spectrum. The system uses
threshold technique as identification tool. The recognition accuracy of this system is 80% on average.
1. “Development of Some Techniques
For Text-Independent Speaker
Recognition From Audio Signals”
By
Bidhan Barai
Under the guidance of
Dr. Nibaran Das and Dr. Subhadip Basu
Assistant Professors of Computer Science & Engineering
Jadavpur University
Kolkata – 700 032
3. Introduction
●
Speaker recognition is the identification of a person from
characteristics of voices (voice biometrics). It is also
called voice recognition. There is a difference between
speaker recognition (recognizing who is speaking) and
speech recognition (recognizing what is being said).
●
In addition, there is a difference between the act of
authentication (commonly referred to as speaker
verification or speaker authentication) and identification.
4. Types of Speaker Identification
● Text-Dependent:
If the text must be the same for enrollment and
verification this is called text-dependent recognition.
In a text-dependent system, prompts can either be
common across all speakers (e.g.: a common pass
phrase) or unique
●
Text-Independent:Text-Independent:
Text-independent systems are most often used for
speaker identification as they require very little if any
cooperation by the speaker. In this case the text
during enrollment and test is different.
5. Types of Speaker Identification
● Closed-Set: Assumed that Speaker is in Database
In closed-set identification, the audio of the test speaker is
compared against all the available speaker models and the
speaker ID of the model with the closest match is returned.
Result is the best speaker matched.
● Open-Set: Speaker may not in Database
Open-set identification may be viewed as a combination of
closed-set identification and speaker verification. Result
can be a speaker or a no-match result.
6. Principles of Automatic Speaker
Recognition
●
Speaker recognition can be classified into identification
and verification.
●
Speaker identification is the process of determining which
registered speaker provides a given utterance.
●
Speaker verification, on the other hand, is the process of
accepting or rejecting the identity claim of a speaker.
●
Following figures shows the basic structures of speaker
identification and verification systems. The system that we
will describe is classified as text-independent speaker
identification system since its task is to identify the person
who speaks regardless of what is saying.
7. Principles of Automatic Speaker
Recognition ... Contd.
Figure 1
Block Diagram of Speaker Recognition SystemBlock Diagram of Speaker Recognition System
8. Principles of Automatic Speaker
Recognition ... Contd.
●
Speaker RecognitionSpeaker Recognition
Feature
Extraction
Similarity
Reference Model
Speaker #1
Similarity
Reference Model
Speaker #N
Maximun
Selection
Identification
Result
(Speaker ID)
Input Speech
Figure 2
9. Principles of Automatic Speaker
Recognition ... Contd.
●
Speaker verificationSpeaker verification
Feature
Extraction
Input Speech Similarity Decision
Reference Model
Speaker #1
Speaker ID
(#M)
Threshold
Verification
Result
(Accept/Reject)
Figure 3
10. Principles of Automatic Speaker
Recognition ... Contd.
●
All speaker recognition systems have to serve two
distinguished phases.
The first one is referred to the enrolment or training phase,
while the second one is referred to as the operational or
testing phase.
●
In the training phase, each registered speaker has to provide
samples of their speech so that the system can build or train
a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific
threshold is also computed from the training samples.
●
In the testing phase, the input speech is matched with
stored reference model(s) and a recognition decision is
made.
12. Step 1: Voice Recording
●
The speech input is typically recorded at a sampling
rate above 10000 Hz (10 kHz).
●
This sampling frequency was chosen to minimize the
effects of aliasing in the analog-to-digital conversion.
These sampled signals can capture all frequencies up
to 5 kHz, which cover most energy of sounds that are
generated by humans.
●
This sampling rate (10 kHz) is determined by the
Nyquest Sampling Theorem.
13. Step 2: Speech Feature
Extraction
●
The purpose of this module is to convert the speech
waveform, using digital signal processing (DSP) tools, to a set
of features (at a considerably lower information rate) for
further analysis. This is often referred as the
signal-processing front end.
●
The speech signal is a slowly timed varying signal (it is called
quasi-stationary). When examined over a sufficiently short
period of time (between 5 and 100 msec), its characteristics
are fairly stationary. However, over long periods of time (on
the order of 1/5 seconds or more) the signal characteristic
change to reflect the different speech sounds being spoken.
●
Therefore, short-time spectral analysis is the most common
way to characterize the speech signal.
14. Speech Feature Extraction...Contd
Examples of Speech Signals:
A wide range of possibilities exist for parametrically representing
the speech signal for the speaker recognition task, such as Linear
Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients
(MFCC), Gammatone Frequency Cepstral Coefficients (GFCC),
Group Delay Features (GDF) and others. MFCC is perhaps the best
known and most popular, and will be described in this project.
Figure 4 Figure 5
15. Speech Feature Extraction...Contd
● Mel-frequency Cepstrum Coefficients Processor:
A block diagram of the structure of an MFCC
processor is given in Figure
Figure 6
16. Speech Feature Extraction...Contd
● Steps of extracting Feature from Speech Signal:
1> Pre-emphasis
2> Frame Blocking
3> Windowing
4> Fast Fourier Transform (FFT)
5> Mel-frequency Wrapping
6> Cepstrum: Logarithmic Compression and Discrete
Cosine Transform (DCT)
17. Speech Feature Extraction...Contd
●
Pre-emphasis: In speech processing, the original
signal usually has too much lower frequency energy,
and processing the signal to emphasize higher
frequency energy is necessary. To perform
pre-emphasis, we choose some value α between 0.9
and 1. Then each value in the signal is re-evaluated
using this formula:
This is apparently a first order high pass filter.
y[n]=x[n]−α x[n−1], where 0.9<α<1
19. Speech Feature Extraction...Contd
● Frame Blocking: The input speech signal is segmented into frames of
20~30 ms with optional overlap of 1/3~1/2 of the frame size. Usually the
frame size (in terms of sample points) is equal to power of two in order to
facilitate the use of FFT. If this is not the case, we need to do zero padding
to the nearest length of power of two.
● Windowing: Each frame has to be multiplied with a hamming window in
order to keep the continuity of the first and the last points in the frame. If
the signal in a frame is denoted by s(n), n = 0,…N-1, then the signal after
Hamming windowing is s(n)*w(n), where w(n) is the Hamming window
defined by:
Different values of corresponds to different curves for the Hamming
windows shown next:
w(n,α)=(1−α)−αcos(
2πn
N−1
), 0≤n≤N−1
α
22. Speech Feature Extraction...Contd
● Fast Fourier Transform (FFT): The Discrete
Fourier Transform (DFT) of a discrete-time signal
x(nT) is given by:
Where:
X (k)= ∑
n=0
N−1
x [n]e
−j
2π
N
nk
k=0,1,…N−1
x (nT )=x [n]
23. Speech Feature Extraction...Contd
●
If we let: thene
−j
2π
N
=W N
X (k)= ∑
n=0
N−1
x [n]W N
nk
0 20 40 60 80 100 120
-2
-1
0
1
2
Sampledsignal
Sample
Amplitude
0 0.1 0.2 0.3 0.4 0.5
0
0.2
0.4
0.6
0.8
1
FrequencyDomain
NormalisedFrequency
Magnitude
Figure 10
24. Speech Feature Extraction...Contd
● x[n] = x[0], x[1], …, x[N-1]
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 [1][1]
Lets divide the sequence x[n] into even and odd
sequences:
x[2n] = x[0], x[2], …, x[N-2]
x[2n+1] = x[1], x[3], …, x[N-1]
25. Speech Feature Extraction...Contd
●
Equation 1 can be rewritten as:
X (k)= ∑
n=0
N
2
−1
x[2n]WN
2nk
+ ∑
n=0
N
2
−1
x[2n+1]WN
(2n+1)k
[2][2]
Since:
W N
2 nk
=e
− j
2π
N
2 nk
=e
−j
2π
N /2
nk
=W N
2
nk W N
(2n+1)k
=W N
k
⋅W N
2
nk
Then:
X (k)= ∑
n=0
N
2
−1
x [2n]WN
2
nk
+WN
k
∑
n=0
N
2
−1
x [2n+1]WN
2
nk
=Y (k )+WN
k
Z (k )
and
26. Speech Feature Extraction...Contd
●
The result is that an N-point DFT can be divided into
two N/2 point DFT’s:
●
Where Y(k) and Z(k) are the two N/2 point DFTs
operating on even and odd samples respectively:
X (k)= ∑
n=0
N−1
x[n]W N
nk
; 0≤k≤N−1 N-point DFTN-point DFT
X (k)= ∑
n=0
N
2
−1
x1[n]WN
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]WN
2
nk
=Y (k )+WN
k
Z (k )
TwoTwo
N/2-pointN/2-point
DFTsDFTs
27. Speech Feature Extraction...Contd
●
Periodicity and symmetry of W can be exploited to
simplify the DFT further:
X (k )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
+W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
⋮
X(k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
n(k +
N
2 )+W
N
k+
N
2
∑
n=0
N
2
−1
x2 [n]W N
2
n(k+
N
2 )
[3][3]
Or: W N
k+
N
2
=e
− j
2π
N
k
e
−j
2π
N
N
2
=e
−j
2π
N
k
e−jπ
=−e
−j
2π
N
k
=−W N
k : Symmetry: Symmetry
And: W N
2
k+
N
2
=e
− j
2π
N /2
k
e
− j
2π
N /2
N
2
=e
− j
2π
N /2
k
=W N
2
k : Periodicity: Periodicity
28. Speech Feature Extraction...Contd
●
Finally by exploiting the symmetry and periodicity,
Equation 3 can be written as:
●
Hence Complete Equations for finding FFT are:
X (k+
N
2 )= ∑
n=0
N
2
−1
x1[n]W N
2
nk
−W N
k
∑
n=0
N
2
−1
x2[n]W N
2
nk
=Y (k)−W N
k
Z (k )
[4][4]
X (k)=Y (k)+WN
k
Z (k ); k=0,…(N
2
−1)
X (k+
N
2 )=Y (k)−WN
k
Z (k); k=0,…(N
2
−1)
30. Speech Feature Extraction...Contd
●
Mel-frequency Wrapping: Psychophysical studies
have shown that human perception of the frequency
content of sounds does not follow a liner scale. That
research has led to the concept of the subjective
frequency, i.e., the perceived frequency of sounds is
defined as follows. For each sound with an actual
frequency, f , measured in Hz, a subjective frequency
is measured on a scale called the "Mel scale".
Mel-frequency can be approximated by
Mel(f )=2595log(1+
f
700
)
32. Speech Feature Extraction...Contd
●
In the Mel-Frequency Scale, there is a linear frequency
spacing below 1000 Hz and a logarithmic spacing above
1000Hz.
●
Triangular Filters Bank: The human ear acts essentially
like a bank of overlapping band-pass filters and human
perception is based on Mel scale. Thus, the approach to
simulating the human perception is to build a filter bank
with bandwidth given by the Mel scale and pass the
magnitudes of the spectra, through these filters and
obtain the Mel-frequency spectrum.
33. Speech Feature Extraction...Contd
●
Equally Spaced Mel values:
●
We define a triangular filter-bank with M filters (m=1, 2,...,M) where,
Hm[ k ] , is the magnitude (frequency response) of the filter given by:
Hm( k) =
{
0, k< f ( m−1)
k− f ( m−1)
f ( m)−f ( m−1)
, f ( m−1)≤k≤f ( m)
f ( m+1)− k
f (m+ 1)−f (m)
, f ( m)≤k≤ f ( m+1)
0, k> f ( m+1)
}
35. Speech Feature Extraction...Contd
●
Given the FFT of the input signal, x[n]
●
The values of FFT are weighted by triangular filters.
The result is called Mel-frequency power spectrum
which is defined as:
where is called the Power Spectrum.
X [k]=∑
n=0
N−1
x[n]e
−j2 π nk/ N
,0≤k≤N
S[m]=∑
k=1
N
∣Xa [k]∣
2
Hm[k],0<m≤M
∣Xa [k]∣
2
36. Speech Feature Extraction...Contd
●
Schematic diagram of Filter Bank Energy:
●
Finally, a discrete cosine transform (DCT) of the
logarithm of S[m] is computed to form the MFCCs as:
mfcc[i]=∑
m=1
M
log(S[m])cos[i(m−
1
2
) π
M
],
i=1,2,........., L
38. GMM
●
Mixture model is a probabilistic model which assumes
the underlying data to belong to a mixture
distribution.
Gaussian is a characteristic symmetric
“bell curve"
39. GMM...Contd
●
Mathematical Description of GMM:
where = Mixed Density Function
= Mixture weight or mixture Coefficient
= Density Function
p(x)=∑
i=1
i=n
wi pi(x)
p(x)
wi
pi(x)
40. GMM...Contd
●
Mathematical Description of GMM:
where = Mixed Density Function
= Mixture weight or mixture Coefficient
= Density Function
p(x)=∑
i=1
i=n
wi pi(x)
p(x)
wi
pi(x)
42. GMM...Contd
● Hence the Density Function is:
● The Descreption of GMM becomes
where ‘s are means and ‘s are covariance-matrix of
individual components(probability density function) .
pi(x)=N (x∣μi ,Σi)
p(x)=∑
i=1
i=n
wi N (x∣μi ,Σi)
μi Σi
G1,w1 G2,w2
G3,w3
G4,w4
G5,w5
43. GMM...Contd
●
The Gaussian (Normal) density function, in which each of
the mixture components are Gaussian distributions, each
with their own mean and variance parameters is the most
common mixture distribution.
The feature vectors follows the Gaussian Distribution.
Hence X is distributed Normally.
: Multi variate Normal Distribution
Where = Means
= Covariance Matrix
μ
Σ
X∼N (x∣μ,Σ)
44. GMM...Contd
●
The GMM for a Speaker is denoted by
Here a speaker is represented by a mixture of M
Gaussian Components.
●
The Gaussian Mixture Density is
λ={wi ,μi ,Σi }, where i=1,2,.........., M
p(⃗x∣λ)=∑
i=1
M
wi pi(⃗x)
where ⃗x = D−dimensional random vector(variable)
45. GMM...Contd●
The Component Density is given by
●
The schematic diagram of the GMM of a speaker is given
below
pi(⃗x)=
1
(2π)D/2
∣Σi∣
1/2
exp{−
1
2
(⃗x−μi)
T
Σi
−1
(⃗x−μi)}
p1() p2()μ1, Σ1
μ2, Σ2
Σ
⃗x
p(⃗x∣λ)
w1
pM () μM , ΣM
w2
wM
. . . .
46. Model Parameter Estimation
●
To create a GMM we are required to find the
numerical values of Model parameters , and
●
To obtain an optimum model representing each
speaker we need to calculate a good estimation of
the GMM parameters. To do that, a very efficient
method is the Maximum-Likelihood Estimation (MLE)
approach. For speaker identification, each speaker is
represented by a GMM and is referred to by his/her
model. In this regard EM algorithm is very useful tool
to find the optimum model parameters by MLE
approach.
wi μi Σi
47. Step 4: Pattern Matching:
Classification
●
In this stage, a series of input vectors are compared,
and a decision is made as to which of the speakers in
the set is the most likely to have spoken the test data.
The input to the classification system is denoted as
●
Using the models of each speaker and the unknown
vectors the fitness values are calculated with the
help of posterior probalility. The speaker model which
gives the maximum fitness value, we classify the
vectors to that speaker.
⃗x={x1, x2, x3,. ................, xT }
⃗x
48. Conclusion...Contd
● Modification can be done in the following
cases:
1> Feature Extraction
2> MFCC Feature
3> Filter Bank
4> Modeling Techniques
5> Pattern Matching
49. Conclusion...Contd
● Feature Extraction: In the MFCC feature the phase
information is not taken into account. Only magnitude is
considered. So using phase information along with the MFCC
feature new feature vectors can be derived.
● Pattern Matching: In pattern matching step it is assumed
that the feature vectors of unknown speaker are
independent. With this assumption posterior probability is
calculated. But we can use some orthogonal transformation
to transform the set of vectors into a new set of orthogonal
vectors. Hence, after the transformation the the vectors
become independent. And then we can proceed as before.
50. References
●
[1] Molau, S., Pitz, M., Schlüter, R. & Ney, H. (2001), Computing Mel-Frequency
Cepstral Coefficients on the Power Spectrum, IEEE International Conference on
Acoustics, Speech and Signal Processing, Germany, 2001: 73-76.
●
[2] Huang, X., Acero, A. & Hon, H. (2001), Spoken Language Processing - A Guide to
Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.
●
[3] Homayoon Beigi, (2011), Fundamentals of Speaker Recognition, Springer.
●
[4] Daniel J. Mashao, Marshalleno Skosan, Combining classifier decisions for robust
speaker identification, ELSEVIER 2006.
●
[5] W.M. Campbell , J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo,
Support vector machines for speaker and language recognition, ELSEVIER, 2006.
●
[6] Seiichi Nakagawa, Kouhei Asakawa, Longbiao Wang, Speaker Recognition by
Combining MFCC and Phase Information, INTERSPEECH 2007.
●
[7] Nilsson, M. & Ejnarsson, M, Speech Recognition Using Hidden Markov Model
Performance Evaluation in Noisy Environment, Blekinge Institute of Technology
Sweden, 2002.
51. References...Contd●
[8] Stevens, S. S. & Volkman, J. (1940), The Relation of the Pitch to
Frequency , Journal of Psychology, 1940(53): 329.
●
[9] A . Jain, A. Ross, and S. Prabhakar, “An introduction to biometric
recognition,” IEEETrans. Circuits Systems Video Technol., vol. 14, no. 1, pp.
4–20, 2004.
●
[10] D. Reynolds, “An overview of automatic speaker recognition
technology,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing
(ICASSP), 2002, vol. 4, pp. 4072–4075.
●
[11] S. Furui, “Cepstral analysis technique for automatic speaker
verification,” IEEE Trans. Acoustics Speech Signal Process., vol. 29, no. 2, pp.
254–272, 1981.
●
[12] D. Reynolds and R. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE Trans. Speech
Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
●
[13] D. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models,” Speech Commun., vol. 17, no. 1–2, pp. 91–108,
1995.
52. References...Contd
●
[14] Man-Wai Mak , Wei Rao, Utterance partitioning with acoustic vector
resampling for GMM–SVM speaker verification, ELSEVIER, 2011.
●
[15] Md. Sahidullah, Goutam Saha, Design, analysis and experimental
evaluation of block based transformation in MFCC computation for speaker
recognition, ELSEVIER, 2011.
●
[16] Qi Li, and Yan Huang, An Auditory-Based Feature Extraction Algorithm for
Robust Speaker Identification Under Mismatched Conditions , IEEE
TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19,
NO. 6, AUGUST 2011.
●
[17] Alfredo Maesa, Fabio Garzia, Michele Scarpiniti, Roberto Cusani, Text
Independent Automatic Speaker Recognition System Using Mel-Frequency
Cepstrum Coefficient and Gaussian Mixture Models, Journal of Information
Security, 2012.
●
[18] Ming Li, Kyu J. Han, Shrikanth Narayanan, Automatic speaker age and
gender recognition using acoustic and prosodic level information fusion,
ELSEVIER, 2013.