This document summarizes a paper that presents a speaker identification system using Mel Frequency Cepstral Coefficients (MFCCs). MFCCs are used to extract features from speech signals that are less susceptible to variations between recordings of the same speaker. Vector quantization is then used to compress the extracted features for matching against enrolled speaker models. The system contains modules for feature extraction using MFCCs and feature matching, which are the two main components of all speaker recognition systems.
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Text-Independent Speaker Verification ReportCody Ray
Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Text-Independent Speaker Verification ReportCody Ray
Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
Voice Identification And Recognition System, MatlabSohaib Tallat
A simple yet complex approach to modern sophistication.
Made this project using the MFCC approach and then embedding the code to a Graphical User Interface. In the end made a standalone application for the program using deployment tools of matlab
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Real Time Speaker Identification System – Design, Implementation and ValidationIDES Editor
This paper presents design, implementation and
validation of a PC based Prototype speaker recognition and
verification system. This system is organized to receive speech
signal, find the features of speech signal, and recognize and
verify a person using voice as the biometric. The system is
implemented to capture the speech signal from microphone
and to compare it with the stored data base using filter-bank
based closed-set speaker verification system. At first, the
identification of the voice signals is done using an algorithm
developed in MATLAB. Next, a PC based prototype system is
developed and is validated in real time. Several tests were made
on different sets of voice signals, and measured the performance
and the speed of the proposed system in real environment. The
result confirmed the use of proposed system for various real
time applications.
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
Abstract Automatic speech recognition is an important topic of speech processing. This paper presents the use of an Artificial Neural Network (ANN) for isolated word recognition. The Pre-processing is done and voiced speech is detected based on energy and zero crossing rates (ZCR). The proposed approach used in speech recognition is Mel Frequency Cepstral Coefficients (MFCC) and combine features of both MFCC and Linear Predictive Coding (LPC). The back-propagation is used as a classifier. The recognition accuracy is increased when combine features of both LPC and MFCC are used as compared to only MFCC approach using Neural Network as a classifier.. Keywords: Pre-processing, Mel frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Artificial Neural Network (ANN).
Voice Identification And Recognition System, MatlabSohaib Tallat
A simple yet complex approach to modern sophistication.
Made this project using the MFCC approach and then embedding the code to a Graphical User Interface. In the end made a standalone application for the program using deployment tools of matlab
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
Real Time Speaker Identification System – Design, Implementation and ValidationIDES Editor
This paper presents design, implementation and
validation of a PC based Prototype speaker recognition and
verification system. This system is organized to receive speech
signal, find the features of speech signal, and recognize and
verify a person using voice as the biometric. The system is
implemented to capture the speech signal from microphone
and to compare it with the stored data base using filter-bank
based closed-set speaker verification system. At first, the
identification of the voice signals is done using an algorithm
developed in MATLAB. Next, a PC based prototype system is
developed and is validated in real time. Several tests were made
on different sets of voice signals, and measured the performance
and the speed of the proposed system in real environment. The
result confirmed the use of proposed system for various real
time applications.
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
In this paper, improvement of an ASR system for
Hindi language, based on Vector quantized MFCC as feature
vectors and HMM as classifier, is discussed. MFCC features
are usually pre-processed before being used for recognition.
One of these pre-processing is to create delta and delta-delta
coefficients and append them to MFCC to create feature vector.
This paper focuses on all digits in Hindi (Zero to Nine), which
is based on isolated word structure. Performance of the system
is evaluated by accurate Recognition Rate (RR). The effect of
the combination of the Delta MFCC (DMFCC) feature along
with the Delta-Delta MFCC (DDMFCC) feature shows
approximately 2.5% further improvement in the RR, with no
additional computational costs involved. RR of the system for
the speakers involved in the training phase is found to give
better recognition accuracy than that for the speakers who
were not involved in the training phase. Word wise RR is
observed to be good in some digits with distinct phones.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
Extraction and selection of the best parametric representation of acoustic signal is the most important task in designing any speaker recognition system. A wide range of possibilities exists for parametrically representing the speech signal such as Linear Prediction Coding (LPC) ,Mel frequency Cepstrum coefficients (MFCC) and others. MFCC are currently the most popular choice for any speaker recognition system, though one of the shortcomings of MFCC is that the signal is assumed to be stationary within the given time frame and is therefore unable to analyze the non-stationary signal. Therefore it is not suitable for noisy speech signals. To overcome this problem several researchers used different types of AM-FM modulation/demodulation techniques for extracting features from speech signal. In some approaches it is proposed to use the wavelet filterbanks for extracting the features. In this paper a technique for extracting the features by combining the above mentioned approaches is proposed. Features are extracted from the envelope of the signal and then passed through wavelet filterbank. It is found that the proposed method outperforms the existing feature extraction techniques.
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...IDES Editor
Using the Mel-frequency cepstral coefficients
(MFCC), Human Factor cepstral coefficients (HFCC) and
their new parameters derived from log dynamic spectrum and
dynamic log spectrum, these features are widely used for
speech recognition in various applications. But, speech
recognition systems based on these features do not perform
efficiently in the noisy conditions, mobile environment and
for speech variation between users of different genders and
ages. To maximize the recognition rate of speaker independent
isolated word recognition system, we combine both of the above
features and proposed a hybrid feature set of them. We tested
the system for this hybrid feature vector and we gained results
with accuracy of 86.17% in clean condition (closed window),
82.33% in class room open window environment, and 73.67%
in outdoor with noisy environment.
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
The paper published in discusses implementation of a robust text-independent speaker recognition system using MFCC extraction of feature vectors its matching using VQ and optimization using LBG, further a text dependent speech recognition system using the DTW algorithm's implementation is discussed in the context of home automation.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
In this paper, we address the speaker independent
recognition of Chinese number speeches 0~9 based on HMM.
Our former results of inside and outside testing achieved
92.5% and 76.79% respectively. To improve further the
performance, two important features of speech; MFCC and
cluster number of vector quantification, are unified together
and evaluated on various values. The best performance
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ
clustering number = 64.
One of the common and easier techniques of feature extraction is Mel Frequency Cestrum Coefficient (MFCC) which allows the signals to extract the feature vector. It is used by Dynamic Feature Extraction and provide high performance rate when compared to previous technique like LPC. But one of the major drawbacks in this technique is robustness. Another feature extraction technique is Relative Spectral (RASTA). In effect the RASTA filter band passes each feature coefficient and in both the log spectral and the Spectral domains appear linear channel distortions as an additive constant. The high-pass portions of the equivalent band pass filter effect the convolution noise introduced in the channel. The low-pass filtering helps in smoothing frame to frame spectral changes. Compared to MFCC feature extraction technique, RASTA filtering reduces the impact of the noise in signals and provides high robustness
Broad Phoneme Classification Using Signal Based Features ijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme classification is achieved using features derived directly from the speech at the signal level itself. Broad phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first three formants. Features derived from short time frames of training speech are used to train a multilayer feedforward neural network based classifier with manually marked class label as output and classification accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction which is useful for applications such as automatic speech recognition and automatic language identification.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
This paper proposes a multimodal biometric system
using palmprint and speech signal. In this paper, we propose a
novel approaches for both the modalities. We extract the
features using Subband Cepstral Coefficients for speech signal
and Modified Canonical method for palmprint. The individual
feature score are passed to the fusion level. Also we have
proposed a new fusion method called weighted score. This
system is tested on clean and degraded database collected by
the authors for more than 300 subjects. The results show
significant improvement in the recognition rate.
Speaker recognition is the computing task of validating a user's claimed identity using characteristics extracted from their voices. Voice -recognition is combination of the two where it uses learned aspects of a speaker’s voice to determine what is being said - such a system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices it has been trained with, which gives us various applications in day today life.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Similar to Speaker identification using mel frequency (20)
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
1. 3rd International Conference on Electrical & Computer Engineering
ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh
SPEAKER IDENTIFICATION USING MEL FREQUENCY
CEPSTRAL COEFFICIENTS
Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani Md. Saifur Rahman
Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology,
Dhaka-1000
E-mail: saif672@yahoo.com
ABSTRACT and techniques. The choice of which technology to
This paper presents a security system based on use is application-specific. At the highest level, all
speaker identification. Mel frequency Cepstral Co- speaker recognition systems contain two main
efficients{MFCCs} have been used for feature modules feature extraction and feature matching
extraction and vector quantization technique is used [2,3].
to minimize the amount of data to be handled .
3. SPEECH FEATURE EXTRACTION
1. INTRODUCTION The purpose of this module is to convert the speech
waveform to some type of parametric representation
Speech is one of the natural forms of
(at a considerably lower information rate). The
communication. Recent development has made it
speech signal is a slowly time varying signal (it is
possible to use this in the security system. In
called quasi-stationary). When examined over a
speaker identification, the task is to use a speech
sufficiently short period of time (between 5 and 100
sample to select the identity of the person that
ms), its characteristics are fairly stationary.
produced the speech from among a population of
However, over long periods of time (on the order of
speakers. In speaker verification, the task is to use a
0.2s or more) the signal characteristics change to
speech sample to test whether a person who claims
reflect the different speech sounds being spoken.
to have produced the speech has in fact done so[1].
Therefore, short-time spectral analysis is the most
This technique makes it possible to use the speakers’
common way to characterize the speech signal. A
voice to verify their identity and control access to
wide range of possibilities exist for parametrically
services such as voice dialing, banking by telephone,
representing the speech signal for the speaker
telephone shopping, database access services,
recognition task, such as Linear Prediction Coding
information services, voice mail, security control for
(LPC), Mel-Frequency Cepstrum Coefficients
confidential information areas, and remote access to
(MFCC), and others. MFCC is perhaps the best
computers.
. known and most popular, and this feature has been
2. PRINCIPLES OF SPEAKER used in this paper. MFCC’s are based on the known
RECOGNITION variation of the human ear’s critical bandwidths with
frequency. The MFCC technique makes use of two
Speaker recognition methods can be divided into types of filter, namely, linearly spaced filters and
text-independent and text-dependent methods. In a logarithmically spaced filters. To capture the
text-independent system, speaker models capture phonetically important characteristics of speech,
characteristics of somebody’s speech which show up signal is expressed in the Mel frequency scale. This
irrespective of what one is saying. [1] scale has a linear frequency spacing below 1000 Hz
In a text-dependent system, on the other hand, the and a logarithmic spacing above 1000 Hz. Normal
recognition of the speaker’s identity is based on his speech waveform may vary from time to time
or her speaking one or more specific phrases, like depending on the physical condition of speakers’
passwords, card numbers, PIN codes, etc. Every vocal cord. Rather than the speech waveforms
technology of speaker recognition, identification and themselves, MFFCs are less susceptible to the said
verification, whether text-independent and text- variations [1,4].
dependent, each has its own advantages and
disadvantages and may require different treatments
ISBN 984-32-1804-4 565
2. 3.1 The MFCC processor
K ~
1 π
= ∑ (log S k ) n k − ,....................(2)
~
A block diagram of the structure of an MFCC cn k =1 2 K
processor is given in Figure 1. The speech input is
recorded at a sampling rate of 22050Hz. This where n=1,2,….K
sampling frequency is chosen to minimize the
effects of aliasing in the analog-to-digital The number of mel cepstrum coefficients, K, is
conversion process. Figure 1. shows the block ~
diagram of an MFCC processor .
typically chosen as 20. The first component, c 0
, is
excluded from the DCT since it represents the mean
Continuous Frame Windowing FFT value of the input signal which carries little speaker
Speech Blocking specific information. By applying the procedure
described above, for each speech frame of about 30
ms with overlap, a set of mel-frequency cepstrum
mel Cepstrum Mel-frequency
coefficients is computed. This set of coefficients is
cepstrum Wrapping called an acoustic vector. These acoustic vectors can
be used to represent and recognize the voice
Sk
characteristic of the speaker [4]. Therefore each
Figure 1 Block diagram of the MFCC processor
input utterance is transformed into a sequence of
3.2 Mel-frequency wrapping acoustic vectors. The next section describes how
these acoustic vectors can be used to represent and
The speech signal consists of tones with different recognize the voice characteristic of a speaker.
frequencies. For each tone with an actual
Frequency, f, measured in Hz, a subjective pitch is 4. FEATURE MATCHING
measured on the ‘Mel’ scale. The mel-frequency The state-of-the-art feature matching techniques
scale is a linear frequency spacing below 1000Hz used in speaker recognition include, Dynamic Time
and a logarithmic spacing above 1000Hz. As a Warping (DTW), Hidden Markov Modeling
reference point, the pitch of a 1kHz tone, 40dB (HMM), and Vector Quantization (VQ). The VQ
above the perceptual hearing threshold, is defined as approach has been used here for its ease of
1000 mels. Therefore we can use the following implementation and high accuracy.
formula to compute the mels for a given frequency f
in Hz[5]: 4.1 Vector quantization
Vector quantization (VQ) is a lossy data
mel(f)= 2595*log10(1+f/700) ……….. (1) compression method based on principle of
blockcoding [6]. It is a fixed-to-fixed length
One approach to simulating the subjective spectrum
is to use a filter bank, one filter for each desired mel- algorithm. VQ may be thought as an
frequency component. The filter bank has a aproximator. Figure 2 shows an example of a 2-
triangular bandpass frequency response, and the dimensional VQ.
spacing as well as the bandwidth is determined by a
constant mel-frequency interval.
3.3 CEPSTRUM
In the final step, the log mel spectrum has to be
converted back to time. The result is called the mel
frequency cepstrum coefficients (MFCCs). The
cepstral representation of the speech spectrum
provides a good representation of the local spectral
properties of the signal for the given frame analysis.
Because the mel spectrum coefficients are real
numbers(and so are their logarithms), they may be Figure 2 An example of a 2-dimensional VQ
converted to the time domain using the Discrete
Cosine Transform (DCT). The MFCCs may be Here, every pair of numbers falling in a particular
calculated using this equation [3,5]: region are approximated by a star associated with
that region. In Figure 2, the stars are called
codevectors and the regions defined by the borders
566
3. are called encoding regions. The set of all
codevectors is called the codebook and the set of all
encoding regions is called the partition of the space
[6].
4.2 LBG design algorithm
The LBG VQ design algorithm is an iterative
algorithm (as proposed by Y. Linde, A. Buzo & R.
Gray) which alternatively solves optimality criteria
[7]. The algorithm requires an initial codebook. The
initial codebook is obtained by the splitting method.
In this method, an initial codevector is set as the
average of the entire training sequence. This Figure 4 Conceptual diagram to illustrate the VQ
codevector is then split into two. The iterative Process.
algorithm is run with these two vectors as the initial
codebook. The final two codevectors are split into In the training phase, a speaker-specific VQ
four and the process is repeated until the desired codebook is generated for each known speaker by
number of codevectors is obtained. The algorithm is clustering his/her training acoustic vectors. The
summarized in the flowchart of Figure 3. resultant codewords (centroids) are shown in Figure
4 by circles and triangles at the centers of the
corresponding blocks for speaker1 and 2,
Find respectively. The distance from a vector to the
Centroid closest codeword of a codebook is called a VQ-
distortion. In the recognition phase, an input
utterance of an unknown voice is “vector-quantized”
Split each using each trained codebook and the total VQ
Centroid
distortion is computed. The speaker corresponding
to the VQ codebook with the smallest total distortion
m=2*m is identified. Figure 5 shows the use of different
number of centroids for the same data field.
Cluster
vectors
Find
Centroids
Compute D
(distortion)
Yes
D'− D
Yes
<∈ m<M
D
No
No Stop
D'=D
Figure 3 Flowchart of VQ-LBG algorithm
Figure 5 Pictorial view of codebook with 5 and 15
In figure 4 , only two speakers and two dimensions centroids respectively.
of the acoustic space are shown. The circles refer to
the acoustic vectors from speaker 1 while the 5. RESULTS
triangles are from speaker 2.
The system has been implemented in Matlab6.1 on
windowsXP platform.The result of the study has
567
4. been presented in Table 1 and Table 2. The speech the system increases. It has been found that
database consists of 21 speakers, which includes 13 combination of Mel frequency and Hamming
male and 8 female speakers. Here, identification rate window gives the best performance. It also suggests
is defined as the ratio of the number of speakers that in order to obtain satisfactory result, the number
identified to the total number of speakers tested. of centroids has to be increased as the number of
speakers increases. The study shows that the linear
Table 1: Identification rate (in %) for different scale can also have a reasonable identification rate if
windows [using Linear scale] a comparatively higher number of centroids is used.
However, the recognition rate using a linear scale
Code Triangular Rectangular Hamming would be much lower if the number of speakers
book increases. Mel scale is also less vulnerable to the
size changes of speaker's vocal cord in course of time.
1 66.67 38.95 57.14
2 85.7 42.85 85.7 The present study is still ongoing, which may
4 90.47 57.14 90.47 include following further works. HMM may be used
8 95.24 57.14 95.24 to improve the efficiency and precision of the
16 100 80.95 100 segmentation to deal with crosstalk, laughter and
32 100 80.95 100 uncharacteristic speech sounds. A more effective
64 100 85.7 100 normalization algorithm can be adopted on extracted
parametric representations of the acoustic signal,
Table 2: Identification rate (in %) for different which would improve the identification rate further.
windows [using Mel scale] Finally, a combination of features (MFCC, LPC,
LPCC, Formant etc) may be used to implement a
Code Triangular Rectangular Hamming robust parametric representation for speaker
book identification.
size
REFERENCES
1 57.14 57.14 57.14
2 85.7 66.67 85.7 [1] Lawrence Rabiner and Biing-Hwang Juang,
4 90.47 76.19 100 Fundamental of Speech Recognition”, Prentice-Hall,
8 95.24 80.95 100 Englewood Cliffs, N.J., 1993.
[2] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi,
16 100 85.7 100 Yu. (1999). “Binary Quantization of Feature Vectors
32 100 90.47 100 for Robust Text-Independent Speaker Identification”
64 100 95.24 100 in IEEE Transactions on Speech and Audio
Processing, Vol. 7, No. 1, January 1999. IEEE, New
Table 1 shows identification rate when triangular, York, NY, U.S.A.
or rectangular, or hamming window is used for [3] F. Soong, E. Rosenberg, B. Juang, and L. Rabiner,
framing in a linear frequency scale. The table "A Vector Quantization Approach to Speaker
clearly shows that as codebook size increases, the Recognition", AT&T Technical Journal, vol. 66,
identification rate for each of the three cases March/April 1987, pp. 14-26
increases, and when codebook size is 16, [4] Comp.speech Frequently Asked Questions WWW
site, http://svr-www.eng.cam.ac.uk/comp.speech/
identification rate is 100% for both the triangular
[5] Jr., J. D., Hansen, J., and Proakis, J. Discrete-Time
and hamming windows.However, in case of Table2 Processing of Speech Signals, second ed. IEEE Press,
the same windows are used along with a Mel scale New York, 2000.
instead of aLinear scale. Here, too, identification [6] R. M. Gray, ``Vector Quantization,'' IEEE ASSP
rate increases with increase in the size of the Magazine, pp. 4--29, April 1984.
codebook. In this case, 100% identification rate is [7] Y. Linde, A. Buzo & R. Gray, “An algorithm for
obtained with a codebook size of 4 when hamming vector quantizer design”, IEEE Transactions on
window is used. Communications, Vol. 28, pp.84-95, 1980.
6. CONCLUSION
The MFCC technique has been applied for speaker
identification. VQ is used to minimize the data of
the extracted feature. The study reveals that as
number of centroids increases, identification rate of
568