This document describes how to build a simple automatic speaker recognition system. It discusses the principles of speaker recognition, which can be identification (determining which registered speaker is speaking) or verification (accepting or rejecting a speaker's claimed identity). The key components are feature extraction and feature matching. Feature extraction converts the speech waveform into features using techniques like MFCC. Feature matching then compares the extracted features to stored reference models to identify the speaker. The document focuses on the speech feature extraction process, which involves framing the speech signal, windowing frames, taking the FFT, and calculating MFCCs to characterize the signal in a way that mimics human hearing.
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
This is a ppt on speech recognition system or automated speech recognition system. I hope that it would be helpful for all the people searching for a presentation on this technology
Voice recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative automatic speaker recognition system. Such a speaker recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.
This is a ppt on speech recognition system or automated speech recognition system. I hope that it would be helpful for all the people searching for a presentation on this technology
This power-point presentation contains 45 slides. It describes SR system (a brief intro), what are the applications, the biological architecture of human speech recognition vs machine architecture, recognition process, flow summery of recognition process and the approaches to the SRS. All this is described in the first few slides (the first part, let's say), after that, this presentation describes the evolution process of SRS through the decades (the middle part), and at the last this presentation describes the machine learning approach in SRS. How neural net enhance the efficiency of a SRS.
Complete power point presentation on SPEECH RECOGNITION TECHNOLOGY.
Very helpful for final year students for their seminar.
One can use this presentation as their final year seminar.
Speech Recognition is a very interesting topic for seminar.
Also known as automatic speech recognition or computer speech recognition which means understanding voice by the computer and performing any required task.
This power-point presentation contains 45 slides. It describes SR system (a brief intro), what are the applications, the biological architecture of human speech recognition vs machine architecture, recognition process, flow summery of recognition process and the approaches to the SRS. All this is described in the first few slides (the first part, let's say), after that, this presentation describes the evolution process of SRS through the decades (the middle part), and at the last this presentation describes the machine learning approach in SRS. How neural net enhance the efficiency of a SRS.
Complete power point presentation on SPEECH RECOGNITION TECHNOLOGY.
Very helpful for final year students for their seminar.
One can use this presentation as their final year seminar.
Speech Recognition is a very interesting topic for seminar.
Also known as automatic speech recognition or computer speech recognition which means understanding voice by the computer and performing any required task.
Speech Recognition Service (SPRS) is a mobile application capable of recognition some of voices. And due to the system compatibility, users can running this application from any mobile devices supported by android system and SPRS is designed especially for deaf people.
Voice recognition service (VRS) recognize voices in our real life especially in our home such that sound of the bell, telephone and door. VRS allows the user to many of the notifications options when you hear a specific sound previously identified where the user can choose one of the available options such as sending a message to mobile number or vibration. VRS requires minimum knowledge of how to use the mobile in order to be able to run it. VRS has Arabic, simple and user-friendly interface.
This document includes a detailed description of system requirements both functional and non-functional, design models and description, functionalities of all system objects, and system testing so it could be used as a user manual for system users.
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.
Text-Independent Speaker Verification ReportCody Ray
Ā
Provides an introduction to the task of speaker recognition, and describes a not-so-novel speaker recognition system based upon a minimum-distance classification scheme. We describe both the theory and practical details for a reference implementation. Furthermore, we discuss an advanced technique for classification based upon Gaussian Mixture Models (GMM). Finally, we discuss the results of a set of experiments performed using our reference implementation.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Comparitive analysis of doa and beamforming algorithms for smart antenna systemseSAT Journals
Ā
Abstract This paper revolves around the implementation of Direction of arrival and Adaptive beam-forming algorithms for Smart Antenna Systems. This paper also investigates the implementation of algorithms on various planner array geometries viz. circular and rectangular. Music algorithm is primarily finds the possible location of desired user and adaptive beam-forming algorithms such as LMS, RLS and CMA algorithms adapts the weights of the array. DOA estimation gives the maximum peak of spectrum with respect to angle of arrival where the desired user is supposed to exist. After DOA estimation weights of array antenna are changed with the changing received signal. This methodology is called as Spectral estimation, which allows the antenna pattern to steer in desired direction estimated by DOA and simultaneously null out the interfering signals. Rate of convergence is the major criterion for comparison for adaptive beam-forming algorithms. Keywords: DOA, MUSIC, LMS, RLS, CMA, SAS.
METHODS FOR IMPROVING THE CLASSIFICATION ACCURACY OF BIOMEDICAL SIGNALS BASED...IAEME Publication
Ā
Biomedical signals are long records of electrical activity within the human body, and they faithfully represent the state of health of a person. Of the many biomedical signals, focus of this work is on Electro-encephalogram (EEG), Electro-cardiogram (ECG) and Electro-myogram (EMG). It is tiresome for physicians to visually examine the long records of biomedical signals to arrive at conclusions. Automated classification of these signals can largely assist the physicians in their diagnostic process. Classifying a biomedical signal is the process of attaching the signal to a disease state or healthy state. Classification Accuracy (CA) depends on the features extracted from the signal and the classification process involved. Certain critical information on the health of a person is usually hidden in the spectral content of the signal. In this paper, effort is made for the improvement in CA when spectral features are included in the classification process.
This whitepaper proposes a Smart Penalty System (SPS), which has the potential to solve prevalent violation issues through the automation of the penalty system. The SPS uses pattern recognition to identify the vehicle license number plate and automatically sends the penalty to the owner of the vehicle.
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
Ā
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
Ā
In this paper, we address the speaker independent
recognition of Chinese number speeches 0~9 based on HMM.
Our former results of inside and outside testing achieved
92.5% and 76.79% respectively. To improve further the
performance, two important features of speech; MFCC and
cluster number of vector quantification, are unified together
and evaluated on various values. The best performance
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ
clustering number = 64.
In this paper we present the implementation of speaker identification system using artificial neural network
with digital signal processing. The system is designed to work with the text-dependent speaker
identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using
an audio wave recorder. The speech features are acquired by the digital signal processing technique. The
identification of speaker using frequency domain data is performed using backpropagation algorithm.
Hamming window and Blackman-Harris window are used to investigate better speaker identification
performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
Ā
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using back propagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
Ā
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using backpropagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Ā
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
Ā
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
Speaker Identification From Youtube Obtained Datasipij
Ā
An efficient, and intuitive algorithm is presented for the identification of speakers from a long dataset (like
YouTube long discussion, Cocktail party recorded audio or video).The goal of automatic speaker
identification is to identify the number of different speakers and prepare a model for that speaker by
extraction, characterization and speaker-specific information contained in the speech signal. It has many
diverse application specially in the field of Surveillance , Immigrations at Airport , cyber security ,
transcription in multi-source of similar sound source, where it is difficult to assign transcription arbitrary.
The most commonly speech parameterization used in speaker verification, K-mean, cepstral analysis, is
detailed. Gaussian mixture modeling, which is the speaker modeling technique is then explained. Gaussian
mixture models (GMM), perhaps the most robust machine learning algorithm has been introduced to
examine and judge carefully speaker identification in text independent. The application or employment of
Gaussian mixture models for monitoring & Analysing speaker identity is encouraged by the familiarity,
awareness, or understanding gained through experience that Gaussian spectrum depict the characteristics
of speaker's spectral conformational pattern and remarkable ability of GMM to construct capricious
densities after that we illustrate 'Expectation maximization' an iterative algorithm which takes some
arbitrary value in initial estimation and carry on the iterative process until the convergence of value is
observed We have tried to obtained 85 ~ 95% of accuracy using speaker modeling of vector quantization
and Gaussian Mixture model ,so by doing various number of experiments we are able to obtain 79 ~ 82%
of identification rate using Vector quantization and 85 ~ 92.6% of identification rate using GMM modeling
by Expectation maximization parameter estimation depending on variation of parameter.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
Ā
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
Ā
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
Ā
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Ā
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as āpredictable inferenceā.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Ā
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Ā
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
Ā
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
ā¢ The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
ā¢ Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
ā¢ Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
ā¢ Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Ā
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But thereās more:
In a second workflow supporting the same use case, youāll see:
Your campaign sent to target colleagues for approval
If the āApproveā button is clicked, a Jira/Zendesk ticket is created for the marketing design team
Butāif the āRejectā button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Connector Corner: Automate dynamic content and events by pushing a button
Ā
Speaker recognition.
1. DSP Mini-Project:
An Automatic Speaker Recognition System
http://www.ifp.uiuc.edu/~minhdo/teaching/speaker_recognition
1 Overview
Speaker recognition is the process of automatically recognizing who is speaking on
the basis of individual information included in speech waves. This technique makes it
possible to use the speaker's voice to verify their identity and control access to services
such as voice dialing, banking by telephone, telephone shopping, database access
services, information services, voice mail, security control for confidential information
areas, and remote access to computers.
This document describes how to build a simple, yet complete and representative
automatic speaker recognition system. Such a speaker recognition system has potential
in many security applications. For example, users have to speak a PIN (Personal
Identification Number) in order to gain access to the laboratory door, or users have to
speak their credit card number over the telephone line to verify their identity. By
checking the voice characteristics of the input utterance, using an automatic speaker
recognition system similar to the one that we will describe, the system is able to add an
extra level of security.
2 Principles of Speaker Recognition
Speaker recognition can be classified into identification and verification. Speaker
identification is the process of determining which registered speaker provides a given
utterance. Speaker verification, on the other hand, is the process of accepting or rejecting
the identity claim of a speaker. Figure 1 shows the basic structures of speaker
identification and verification systems. The system that we will describe is classified as
text-independent speaker identification system since its task is to identify the person who
speaks regardless of what is saying.
At the highest level, all speaker recognition systems contain two main modules (refer
to Figure 1): feature extraction and feature matching. Feature extraction is the process
that extracts a small amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual procedure to identify the
unknown speaker by comparing extracted features from his/her voice input with the ones
from a set of known speakers. We will discuss each module in detail in later sections.
1
2. Similarity
Reference
model Maximum Identification
Input Feature
(Speaker #1) selection result
speech extraction
(Speaker ID)
Similarity
Reference
model
(Speaker #N)
(a) Speaker identification
Verification
Input Feature result
Similarity Decision
speech extraction (Accept/Reject)
Reference
Speaker ID Threshold
model
(#M) (Speaker #M)
(b) Speaker verification
Figure 1. Basic structures of speaker recognition systems
All speaker recognition systems have to serve two distinguished phases. The first one
is referred to the enrolment or training phase, while the second one is referred to as the
operational or testing phase. In the training phase, each registered speaker has to provide
samples of their speech so that the system can build or train a reference model for that
speaker. In case of speaker verification systems, in addition, a speaker-specific threshold
is also computed from the training samples. In the testing phase, the input speech is
matched with stored reference model(s) and a recognition decision is made.
Speaker recognition is a difficult task. Automatic speaker recognition works based
on the premise that a personās speech exhibits characteristics that are unique to the
speaker. However this task has been challenged by the highly variant of input speech
signals. The principle source of variance is the speaker himself/herself. Speech signals
in training and testing sessions can be greatly different due to many facts such as people
2
3. voice change with time, health conditions (e.g. the speaker has a cold), speaking rates,
and so on. There are also other factors, beyond speaker variability, that present a
challenge to speaker recognition technology. Examples of these are acoustical noise and
variations in recording environments (e.g. speaker uses different telephone handsets).
3 Speech Feature Extraction
3.1 Introduction
The purpose of this module is to convert the speech waveform, using digital signal
processing (DSP) tools, to a set of features (at a considerably lower information rate) for
further analysis. This is often referred as the signal-processing front end.
The speech signal is a slowly timed varying signal (it is called quasi-stationary). An
example of speech signal is shown in Figure 2. When examined over a sufficiently short
period of time (between 5 and 100 msec), its characteristics are fairly stationary.
However, over long periods of time (on the order of 1/5 seconds or more) the signal
characteristic change to reflect the different speech sounds being spoken. Therefore,
short-time spectral analysis is the most common way to characterize the speech signal.
0.5
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
-0.5
0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018
Time (second)
Figure 2. Example of speech signal
3
4. A wide range of possibilities exist for parametrically representing the speech signal
for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency
Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and most
popular, and will be described in this paper.
MFCCās are based on the known variation of the human earās critical bandwidths
with frequency, filters spaced linearly at low frequencies and logarithmically at high
frequencies have been used to capture the phonetically important characteristics of
speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing
below 1000 Hz and a logarithmic spacing above 1000 Hz. The process of computing
MFCCs is described in more detail next.
3.2 Mel-frequency cepstrum coefficients processor
A block diagram of the structure of an MFCC processor is given in Figure 3. The
speech input is typically recorded at a sampling rate above 10000 Hz. This sampling
frequency was chosen to minimize the effects of aliasing in the analog-to-digital
conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover
most energy of sounds that are generated by humans. As been discussed previously, the
main purpose of the MFCC processor is to mimic the behavior of the human ears. In
addition, rather than the speech waveforms themselves, MFFCās are shown to be less
susceptible to mentioned variations.
continuous Frame frame Windowing FFT spectrum
speech Blocking
mel Cepstrum mel Mel-frequency
cepstrum spectrum Wrapping
Figure 3. Block diagram of the MFCC processor
3.2.1 Frame Blocking
In this step the continuous speech signal is blocked into frames of N samples, with
adjacent frames being separated by M (M < N). The first frame consists of the first N
samples. The second frame begins M samples after the first frame, and overlaps it by N -
M samples and so on. This process continues until all the speech is accounted for within
one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30
msec windowing and facilitate the fast radix-2 FFT) and M = 100.
4
5. 3.2.2 Windowing
The next step in the processing is to window each individual frame so as to minimize
the signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
beginning and end of each frame. If we define the window as ,
w( n ), 0 ā¤ ā¤ ā
n N 1
where N is the number of samples in each frame, then the result of windowing is the
signal
y l ( n ) = l ( n ) w( n ),
x 0 ā¤ ā¤ ā
n N 1
Typically the Hamming window is used, which has the form:
ļ£« 2Ļ ļ£¶
n
w( n) =0.54 ā0.46 cosļ£¬ ļ£·, 0 ā¤n ā¤ N ā1
ļ£ N ā ļ£ø
1
3.2.3 Fast Fourier Transform (FFT)
The next processing step is the Fast Fourier Transform, which converts each frame of
N samples from the time domain into the frequency domain. The FFT is a fast algorithm
to implement the Discrete Fourier Transform (DFT), which is defined on the set of N
samples {xn}, as follow:
Nā1
X k = āx n e ā j 2Ļkn / N , k = 0,1,2,..., N ā1
n =0
In general Xkās are complex numbers and we only consider their absolute values
(frequency magnitudes). The resulting sequence {Xk} is interpreted as follow: positive
frequencies 0 ā¤f <F / 2 correspond to values
s
, while negative
0 ā¤ ā¤ / 2 ā
n N 1
frequencies ā s / 2 <f <
F 0
correspond to . Here, Fs denotes the
N / 2 +
1ā¤ā¤ ā
n N 1
sampling frequency.
The result after this step is often referred to as spectrum or periodogram.
3.2.4 Mel-frequency Wrapping
As mentioned above, psychophysical studies have shown that human perception of
the frequency contents of sounds for speech signals does not follow a linear scale. Thus
for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured
on a scale called the āmelā scale. The mel-frequency scale is a linear frequency spacing
below 1000 Hz and a logarithmic spacing above 1000 Hz.
5
6. Ms c ft r a
epe ie n
l a dl b k
-
2
1
.8
1
.6
1
.4
1
.2
1
0
.8
0
.6
0
.4
0
.2
0
0 10
00 20
00 30
00 40
00 50
00 60
00 70
00
Fq n ( z
r ucH
e ey )
Figure 4. An example of mel-spaced filterbank
One approach to simulating the subjective spectrum is to use a filter bank, spaced
uniformly on the mel-scale (see Figure 4). That filter bank has a triangular bandpass
frequency response, and the spacing as well as the bandwidth is determined by a constant
mel frequency interval. The number of mel spectrum coefficients, K, is typically chosen
as 20. Note that this filter bank is applied in the frequency domain, thus it simply
amounts to applying the triangle-shape windows as in the Figure 4 to the spectrum. A
useful way of thinking about this mel-wrapping filter bank is to view each filter as a
histogram bin (where bins have overlap) in the frequency domain.
3.2.5 Cepstrum
In this final step, we convert the log mel spectrum back to time. The result is called
the mel frequency cepstrum coefficients (MFCC). The cepstral representation of the
speech spectrum provides a good representation of the local spectral properties of the
signal for the given frame analysis. Because the mel spectrum coefficients (and so their
6
7. logarithm) are real numbers, we can convert them to the time domain using the Discrete
Cosine Transform (DCT). Therefore if we denote those mel power spectrum coefficients
that are the result of the last step are ~
, we can calculate the
S 0 , k = 2,..., K ā
0, 1
MFCC's, c asn
~,
~ K ~ ļ£® ļ£« 1ļ£¶Ļ ļ£¹
cn = ā (log S k ) cos ļ£Ænļ£¬ k ā ļ£· ļ£ŗ, n = 0,1,..., K-1
k =1 ļ£° ļ£ 2ļ£ø K ļ£»
Note that we exclude the first component, c~ , from the DCT since it represents the
0
mean value of the input signal, which carried little speaker specific information.
3.3 Summary
By applying the procedure described above, for each speech frame of around 30msec
with overlap, a set of mel-frequency cepstrum coefficients is computed. These are result
of a cosine transform of the logarithm of the short-term power spectrum expressed on a
mel-frequency scale. This set of coefficients is called an acoustic vector. Therefore each
input utterance is transformed into a sequence of acoustic vectors. In the next section we
will see how those acoustic vectors can be used to represent and recognize the voice
characteristic of the speaker.
4 Feature Matching
4.1 Overview
The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest
are generically called patterns and in our case are sequences of acoustic vectors that are
extracted from an input speech using the techniques described in the previous section.
The classes here refer to individual speakers. Since the classification procedure in our
case is applied on extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which
are already known, then one has a problem in supervised pattern recognition. These
patterns comprise the training set and are used to derive a classification algorithm. The
remaining patterns are then used to test the classification algorithm; these patterns are
collectively referred to as the test set. If the correct classes of the individual patterns in
the test set are also known, then one can evaluate the performance of the algorithm.
The state-of-the-art in feature matching techniques used in speaker recognition
include Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector
7
8. Quantization (VQ). In this project, the VQ approach will be used, due to ease of
implementation and high accuracy. VQ is a process of mapping vectors from a large
vector space to a finite number of regions in that space. Each region is called a cluster
and can be represented by its center called a codeword. The collection of all codewords
is called a codebook.
Figure 5 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The
circles refer to the acoustic vectors from the speaker 1 while the triangles are from the
speaker 2. In the training phase, using the clustering algorithm described in Section 4.2,
a speaker-specific VQ codebook is generated for each known speaker by clustering
his/her training acoustic vectors. The result codewords (centroids) are shown in Figure 5
by black circles and black triangles for speaker 1 and 2, respectively. The distance from
a vector to the closest codeword of a codebook is called a VQ-distortion. In the
recognition phase, an input utterance of an unknown voice is āvector-quantizedā using
each trained codebook and the total VQ distortion is computed. The speaker
corresponding to the VQ codebook with smallest total distortion is identified as the
speaker of the input utterance.
Speaker 1 Speaker 2
Speaker 1
centroid
sample VQ distortion
Speaker 2
centroid
sample
Figure 5. Conceptual diagram illustrating vector quantization codebook formation.
One speaker can be discriminated from another based of the location of centroids.
(Adapted from Song et al., 1987)
4.2 Clustering the Training Vectors
After the enrolment session, the acoustic vectors extracted from input speech of each
speaker provide a set of training vectors for that speaker. As described above, the next
8
9. important step is to build a speaker-specific VQ codebook for each speaker using those
training vectors. There is a well-know algorithm, namely LBG algorithm [Linde, Buzo
and Gray, 1980], for clustering a set of L training vectors into a set of M codebook
vectors. The algorithm is formally implemented by the following recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors
(hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook yn according to
the rule
Īµ
y + =y n (1 + )
n
Īµ
y ā =y n (1 ā )
n
where n varies from 1 to the current size of the codebook, and Īµ
is a splitting
parameter (we choose Īµ =0.01).
3. Nearest-Neighbor Search: for each training vector, find the codeword in the current
codebook that is closest (in terms of similarity measurement), and assign that vector
to the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training
vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset
threshold
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first
by designing a 1-vector codebook, then uses a splitting technique on the codewords to
initialize the search for a 2-vector codebook, and continues the splitting process until the
desired M-vector codebook is obtained.
Figure 6 shows, in a flow diagram, the detailed steps of the LBG algorithm. āCluster
vectorsā is the nearest-neighbor search procedure which assigns each training vector to a
cluster associated with the closest codeword. āFind centroidsā is the centroid update
procedure. āCompute D (distortion)ā sums the distances of all training vectors in the
nearest-neighbor search so as to determine whether the procedure has converged.
9
10. Find
centroid
Yes No
m<M Stop
Split each
centroid
m = 2*m
Cluster
vectors
Find
centroids
Compute D
(distortion)
No D 'āD Yes
Dā = D <Īµ
D
Figure 6. Flow diagram of the LBG algorithm (Adapted from Rabiner and Juang, 1993)
5 Project
As stated before, in this project we will experiment with the building and testing of an
automatic speaker recognition system. In order to build such a system, one have to go
through the steps that were described in previous sections. The most convenient
platform for this is the Matlab environment since many of the above tasks were already
implemented in Matlab. The project Web page given at the beginning provides a test
database and several helper functions to ease the development process. We supplied you
with two utility functions: melfb and disteu; and two main functions: train and
test. Download all of these files from the project Web page into your working folder.
The first two files can be treated as a black box, but the later two needs to be thoroughly
understood. In fact, your tasks are to write two missing functions: mfcc and vqlbg,
which will be called from the given main functions. In order to accomplish that, follow
each step in this section carefully and check your understanding by answering all the
questions.
10
11. 5.1 Speech Data
Down load the ZIP file of the speech database from the project Web page. After
unzipping the file correctly, you will find two folders, TRAIN and TEST, each contains 8
files, named: S1.WAV, S2.WAV, ā¦, S8.WAV; each is labeled after the ID of the
speaker. These files were recorded in Microsoft WAV format. In Windows systems, you
can listen to the recorded sounds by double clicking into the files.
Our goal is to train a voice model (or more specific, a VQ codebook in the MFCC
vector space) for each speaker S1 - S8 using the corresponding sound file in the TRAIN
folder. After this training step, the system would have knowledge of the voice
characteristic of each (known) speaker. Next, in the testing phase, the system will be
able to identify the (assumed unknown) speaker of each sound file in the TEST folder.
Question 1: Play each sound file in the TRAIN folder. Can you distinguish the voices
of the eight speakers in the database? Now play each sound in the TEST folder in a
random order without looking at the file name (pretending that you do not known the
speaker) and try to identify the speaker using your knowledge of their voices that you just
learned from the TRAIN folder. This is exactly what the computer will do in our system.
What is your (human performance) recognition rate? Record this result so that it could
be later on compared against the computer performance of our system.
5.2 Speech Processing
In this phase you are required to write a Matlab function that reads a sound file and
turns it into a sequence of MFCC (acoustic vectors) using the speech processing steps
described previously. Many of those tasks are already provided by either standard or our
supplied Matlab functions. The Matlab functions that you would need are: wavread,
hamming, fft, dct and melfb (supplied function). Type help function_name
at the Matlab prompt for more information about these functions.
Question 2: Read a sound file into Matlab. Check it by playing the sound file in Matlab
using the function: sound. What is the sampling rate? What is the highest frequency
that the recorded sound can capture with fidelity? With that sampling rate, how many
msecs of actual speech are contained in a block of 256 samples?
Plot the signal to view it in the time domain. It should be obvious that the raw data
in the time domain has a very high amount of data and it is difficult for analyzing the
voice characteristic. So the motivation for this step (speech feature extraction) should be
clear now!
Now cut the speech signal (a vector) into frames with overlap (refer to the frame
section in the theory part). The result is a matrix where each column is a frame of N
samples from original speech signal. Applying the steps āWindowingā and āFFTā to
transform the signal into the frequency domain. This process is used in many different
11
12. applications and is referred in literature as Windowed Fourier Transform (WFT) or Short-
Time Fourier Transform (STFT). The result is often called as the spectrum or
periodogram.
Question 3: After successfully running the preceding process, what is the
interpretation of the result? Compute the power spectrum and plot it out using the
imagesc command. Note that it is better to view the power spectrum on the log scale.
Locate the region in the plot that contains most of the energy. Translate this location
into the actual ranges in time (msec) and frequency (in Hz) of the input speech signal.
Question 4: Compute and plot the power spectrum of a speech file using different
frame size: for example N = 128, 256 and 512. In each case, set the frame increment M
to be about N/3. Can you describe and explain the differences among those spectra?
The last step in speech processing is converting the power spectrum into mel-
frequency cepstrum coefficients. The supplied function melfb facilitates this task.
Question 5: Type help melfb at the Matlab prompt for more information about
this function. Follow the guidelines to plot out the mel-spaced filter bank. What is the
behavior of this filter bank? Compare it with the theoretical part.
Question 6: Compute and plot the spectrum of a speech file before and after the mel-
frequency wrapping step. Describe and explain the impact of the melfb program.
Finally, complete the āCepstrumā step and put all pieces together into a single Matlab
function, mfcc, which performs the MFCC processing.
5.3 Vector Quantization
The result of the last section is that we transform speech signals into vectors in an
acoustic space. In this section, we will apply the VQ-based pattern recognition technique
to build speaker reference models from those vectors in the training phase and then can
identify any sequences of acoustic vectors uttered by unknown speakers.
Question 7: To inspect the acoustic space (MFCC vectors) we can pick any two
dimensions (say the 5th and the 6th) and plot the data points in a 2D plane. Use acoustic
vectors of two different speakers and plot data points in two different colors. Do the data
regions from the two speakers overlap each other? Are they in clusters?
Now write a Matlab function, vqlbg that trains a VQ codebook using the LGB
algorithm described before. Use the supplied utility function disteu to compute the
pairwise Euclidean distances between the codewords and training vectors in the iterative
process.
12
13. Question 8: Plot the resulting VQ codewords after function vqlbg using the same
two dimensions over the plot of the previous question. Compare the result with Figure 5.
5.4 Simulation and Evaluation
Now is the final part! Use the two supplied programs: train and test (which
require two functions mfcc and vqlbg that you just complete) to simulate the training
and testing procedure in speaker recognition system, respectively.
Question 9: What is recognition rate our system can perform? Compare this with the
human performance. For the cases that the system makes errors, re-listen to the speech
files and try to come up with some explanations.
Question 10: You can also test the system with your own speech files. Use the
Windowās program Sound Recorder to record more voices from yourself and your
friends. Each new speaker needs to provide one speech file for training and one for
testing. Can the system recognize your voice? Enjoy!
13
14. REFERENCES
[1] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall,
Englewood Cliffs, N.J., 1993.
[2] L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-
Hall, Englewood Cliffs, N.J., 1978.
[3] S.B. Davis and P. Mermelstein, āComparison of parametric representations for
monosyllabic word recognition in continuously spoken sentencesā, IEEE
Transactions on Acoustics, Speech, Signal Processing, Vol. ASSP-28, No. 4,
August 1980.
[4] Y. Linde, A. Buzo & R. Gray, āAn algorithm for vector quantizer designā, IEEE
Transactions on Communications, Vol. 28, pp.84-95, 1980.
[5] S. Furui, āSpeaker independent isolated word recognition using dynamic features of
speech spectrumā, IEEE Transactions on Acoustic, Speech, Signal Processing, Vol.
ASSP-34, No. 1, pp. 52-59, February 1986.
[6] S. Furui, āAn overview of speaker recognition technologyā, ESCA Workshop on
Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.
[7] F.K. Song, A.E. Rosenberg and B.H. Juang, āA vector quantisation approach to
speaker recognitionā, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March 1987.
[8] comp.speech Frequently Asked Questions WWW site,
http://svr-www.eng.cam.ac.uk/comp.speech/
14