SlideShare a Scribd company logo
PRINCIPLES OF SPEECH RECOGNITION
Speech recognition can be classified into identification and verification. Speech identification
is the process of determining which registered speaker provides a given utterance. Speech
verification, on the other hand, is the process of accepting or rejecting the identity claim of a
speaker. Figure 1 shows the basic structures of speech identification and verification systems.
All technologies of speech recognition, identification and verification, each have its own
advantages and disadvantages and might require different treatment and techniques. The choice of
which technology to use is application-specific. At the highest level, all speech recognition systems
contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature
extraction is the process that extracts a small amount of data from the voice signal that can later be
used to represent each string. Feature matching involves the actual procedure to identify the
unknown string by comparing extracted features from his/her voice input with the ones from a set
of known speakers. We will discuss each module in detail in later sections.
(a) Speaker identification
Input
speech
Feature
extraction
Reference
model
(Speaker #1)
Similarity
Reference
model
(Speaker #N)
Similarity
Maximum
selection
Identification
result
(Speaker ID)
(b) Speaker verification
Figure 1.1 Basic structures of speaker recognition system
All speaker recognition systems have to serve two distinguished phases. The first one is referred to
the enrolment or training phase, while the second one is referred to as the operational or testing
phase. In the training phase, each registered speaker has to provide samples of their speech so that
the system can build or train a reference model for that speaker. In case of speaker verification
systems, in addition, a speaker-specific threshold is also computed from the training samples. In the
testing phase, the input speech is matched with stored reference model(s) and a recognition
decision is made.
Speech signals in training and testing sessions can be greatly different due to many facts
such as people voice change with time, health conditions (e.g. the speaker has a cold),
speaking rates, and so on. There are also other factors, beyond speaker variability, that
present a challenge to speaker recognition technology. Examples of these are acoustical
noise and variations in recording environments (e.g. speaker uses different telephone
handsets).
Reference
model
(Speaker #M)
Similarity
Input
speech
Feature
extraction
Verification
result
(Accept/Reject)
Decision
ThresholdSpeaker ID
(#M)
SPEECH ANALYSIS
OVERVIEW
Phonetics is part of the linguistic sciences. It is concerned with the sounds produced by the
human vocal organs, and more specifically, the sounds which are used in human speech. One
important aspect of Phonetics research is the instrumental analysis of speech. This is often referred
to as experimental Phonetics or machines Phonetics.
The Speech Analysis Series is a series of articles examining different aspects of
presentation analysis. One will learn how to study a speech and how to deliver an effective
speech evaluation. Later articles will examine Toastmasters evaluation contests and speech
evaluation forms and resources.
The document describes how to build a simple, yet complete and representative
speech recognition system. Such a speech recognition system has potential in many security
applications. For example, users have to speak a PIN (Personal Identification Number) in
order to gain access to the laboratory door, or users have to speak their credit card number
over the telephone line to verify their identity. By checking the voice characteristics of the
input utterance, using an speech recognition system similar to the one that we will describe,
the system is able to add an extra level of security
NOTE: The sounds in this demo are down sampled to 8 bit, 8000Hz. You may want to widen
you browser’s window to view the images.
FEATURE EXTRACTION
OVERVIEW
The purpose of this module is to convert the speech waveform of some type of
parametric representation (at a considerably lower information rate) for further analysis and
processing. This is often referred as the signal-processing front end. The speech signal is a
slowly timed varying signal (it is called quasi-stationary). An example of speech signal is
shown in Figure 2. When examined over a sufficiently short period of time (between 5 and
100 msec), its characteristics are fairly stationary. However, over long periods of time (on
the order of 1/5 seconds or more) the signal characteristic change to reflect the different
speech sounds being spoken. Therefore, short-time spectral analysis is the most common
way to characterize the speech signal.
FEATURE MATCHING
OVERVIEW
The problem of speech recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest are
generically called patterns and in our case are sequences of acoustic vectors that are extracted
from an input speech using the techniques described in the previous section. The classes here
refer to individual speakers. Since the classification procedure in our case is applied on
extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition. These patterns
comprise the training set and are used to derive a classification algorithm. The remaining
patterns are then used to test the classification algorithm; these patterns are collectively
referred to as the test set. If the correct classes of the individual patterns in the test set are
also known, then one can evaluate the performance of the algorithm.
The state-of-the-art in feature matching techniques used in speaker recognition include
Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector
Quantization (VQ). In this project, the VQ approach will be used, due to ease of
implementation and high accuracy. VQ is a process of mapping vectors from a large vector
space to a finite number of regions in that space. Each region is called a cluster and can be
represented by its center called a codeword. The collection of all codewords is called a
codebook.
Figure 4.1 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The circles
refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In
the training phase, using the clustering algorithm described in Section 4.2, a speaker-specific
VQ codebook is generated for each known speaker by clustering his/her training acoustic
vectors. The result codewords (centroids) are shown in Figure 5 by black circles and black
triangles for speaker 1 and 2, respectively. The distance from a vector to the closest
codeword of a codebook is called a VQ-distortion. In the recognition phase, an input
utterance of an unknown voice is “vector-quantized” using each trained codebook and the
total VQ distortion is computed. The speaker corresponding to the VQ codebook with
smallest total distortion is identified as the speaker of the input utterance.

More Related Content

What's hot

International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAbdullah al Mamun
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systemsNamratha Dcruz
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Text-Independent Speaker Verification
Text-Independent Speaker VerificationText-Independent Speaker Verification
Text-Independent Speaker VerificationCody Ray
 
Text-Independent Speaker Verification Report
Text-Independent Speaker Verification ReportText-Independent Speaker Verification Report
Text-Independent Speaker Verification ReportCody Ray
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlabArcanjo Salazaku
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition systemDeepesh Lekhak
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONniranjan kumar
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABsipij
 

What's hot (20)

International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Automatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approachAutomatic Speaker Recognition system using MFCC and VQ approach
Automatic Speaker Recognition system using MFCC and VQ approach
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approach
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systems
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Text-Independent Speaker Verification
Text-Independent Speaker VerificationText-Independent Speaker Verification
Text-Independent Speaker Verification
 
Text-Independent Speaker Verification Report
Text-Independent Speaker Verification ReportText-Independent Speaker Verification Report
Text-Independent Speaker Verification Report
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlab
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITIONSPEKER RECOGNITION UNDER LIMITED DATA CODITION
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
 

Viewers also liked

Stephy index page no 1 to 25 2
Stephy  index page no 1 to 25 2Stephy  index page no 1 to 25 2
Stephy index page no 1 to 25 2stephy97
 
Josh McNulty Official Resume
Josh McNulty Official ResumeJosh McNulty Official Resume
Josh McNulty Official ResumeJosh McNulty
 
Guillan Barre Syndrome (A Case Report)
Guillan Barre Syndrome (A Case Report)Guillan Barre Syndrome (A Case Report)
Guillan Barre Syndrome (A Case Report)farzad najafi pour
 
ICS Training Certificates - Bundled
ICS Training Certificates - BundledICS Training Certificates - Bundled
ICS Training Certificates - BundledLaDetria Upshaw
 
TRTA-TaxKS-SFDCProjCompletion-Certificate
TRTA-TaxKS-SFDCProjCompletion-CertificateTRTA-TaxKS-SFDCProjCompletion-Certificate
TRTA-TaxKS-SFDCProjCompletion-CertificateMahesh Lad
 
підсумки роботи волонтерських загонів за 2015р
підсумки роботи волонтерських загонів за 2015рпідсумки роботи волонтерських загонів за 2015р
підсумки роботи волонтерських загонів за 2015рАлександр Дрон
 

Viewers also liked (6)

Stephy index page no 1 to 25 2
Stephy  index page no 1 to 25 2Stephy  index page no 1 to 25 2
Stephy index page no 1 to 25 2
 
Josh McNulty Official Resume
Josh McNulty Official ResumeJosh McNulty Official Resume
Josh McNulty Official Resume
 
Guillan Barre Syndrome (A Case Report)
Guillan Barre Syndrome (A Case Report)Guillan Barre Syndrome (A Case Report)
Guillan Barre Syndrome (A Case Report)
 
ICS Training Certificates - Bundled
ICS Training Certificates - BundledICS Training Certificates - Bundled
ICS Training Certificates - Bundled
 
TRTA-TaxKS-SFDCProjCompletion-Certificate
TRTA-TaxKS-SFDCProjCompletion-CertificateTRTA-TaxKS-SFDCProjCompletion-Certificate
TRTA-TaxKS-SFDCProjCompletion-Certificate
 
підсумки роботи волонтерських загонів за 2015р
підсумки роботи волонтерських загонів за 2015рпідсумки роботи волонтерських загонів за 2015р
підсумки роботи волонтерських загонів за 2015р
 

Similar to Bachelors project summary

Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identificationIJCSEA Journal
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNIJCSEA Journal
 
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...AIRCC Publishing Corporation
 
Using AI to recognise person
Using AI to recognise personUsing AI to recognise person
Using AI to recognise personSolutionsPortal
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
An overview of speaker recognition by Bhusan Chettri.pdf
An overview of speaker recognition by Bhusan Chettri.pdfAn overview of speaker recognition by Bhusan Chettri.pdf
An overview of speaker recognition by Bhusan Chettri.pdfBhusan Chettri
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
 
Automatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfAutomatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfBhusan Chettri
 
Robust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat labRobust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat labIRJET Journal
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...IRJET Journal
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 

Similar to Bachelors project summary (20)

Utterance based speaker identification
Utterance based speaker identificationUtterance based speaker identification
Utterance based speaker identification
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANNUtterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
 
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
 
Using AI to recognise person
Using AI to recognise personUsing AI to recognise person
Using AI to recognise person
 
H42045359
H42045359H42045359
H42045359
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
An overview of speaker recognition by Bhusan Chettri.pdf
An overview of speaker recognition by Bhusan Chettri.pdfAn overview of speaker recognition by Bhusan Chettri.pdf
An overview of speaker recognition by Bhusan Chettri.pdf
 
Ijetcas14 426
Ijetcas14 426Ijetcas14 426
Ijetcas14 426
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...
 
Automatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdfAutomatic Speaker Recognition and AI.pdf
Automatic Speaker Recognition and AI.pdf
 
52 57
52 5752 57
52 57
 
K010416167
K010416167K010416167
K010416167
 
Robust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat labRobust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat lab
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
50120140502007
5012014050200750120140502007
50120140502007
 
Dy36749754
Dy36749754Dy36749754
Dy36749754
 
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
Analysis of Suitable Extraction Methods and Classifiers For Speaker Identific...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 

Bachelors project summary

  • 1. PRINCIPLES OF SPEECH RECOGNITION Speech recognition can be classified into identification and verification. Speech identification is the process of determining which registered speaker provides a given utterance. Speech verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure 1 shows the basic structures of speech identification and verification systems. All technologies of speech recognition, identification and verification, each have its own advantages and disadvantages and might require different treatment and techniques. The choice of which technology to use is application-specific. At the highest level, all speech recognition systems contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each string. Feature matching involves the actual procedure to identify the unknown string by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections. (a) Speaker identification Input speech Feature extraction Reference model (Speaker #1) Similarity Reference model (Speaker #N) Similarity Maximum selection Identification result (Speaker ID)
  • 2. (b) Speaker verification Figure 1.1 Basic structures of speaker recognition system All speaker recognition systems have to serve two distinguished phases. The first one is referred to the enrolment or training phase, while the second one is referred to as the operational or testing phase. In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made. Speech signals in training and testing sessions can be greatly different due to many facts such as people voice change with time, health conditions (e.g. the speaker has a cold), speaking rates, and so on. There are also other factors, beyond speaker variability, that present a challenge to speaker recognition technology. Examples of these are acoustical noise and variations in recording environments (e.g. speaker uses different telephone handsets). Reference model (Speaker #M) Similarity Input speech Feature extraction Verification result (Accept/Reject) Decision ThresholdSpeaker ID (#M)
  • 3. SPEECH ANALYSIS OVERVIEW Phonetics is part of the linguistic sciences. It is concerned with the sounds produced by the human vocal organs, and more specifically, the sounds which are used in human speech. One important aspect of Phonetics research is the instrumental analysis of speech. This is often referred to as experimental Phonetics or machines Phonetics. The Speech Analysis Series is a series of articles examining different aspects of presentation analysis. One will learn how to study a speech and how to deliver an effective speech evaluation. Later articles will examine Toastmasters evaluation contests and speech evaluation forms and resources. The document describes how to build a simple, yet complete and representative speech recognition system. Such a speech recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an speech recognition system similar to the one that we will describe, the system is able to add an extra level of security NOTE: The sounds in this demo are down sampled to 8 bit, 8000Hz. You may want to widen you browser’s window to view the images.
  • 4. FEATURE EXTRACTION OVERVIEW The purpose of this module is to convert the speech waveform of some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. The speech signal is a slowly timed varying signal (it is called quasi-stationary). An example of speech signal is shown in Figure 2. When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal.
  • 5. FEATURE MATCHING OVERVIEW The problem of speech recognition belongs to a much broader topic in scientific and engineering so called pattern recognition. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. The classes here refer to individual speakers. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching. Furthermore, if there exists some set of patterns that the individual classes of which are already known, then one has a problem in supervised pattern recognition. These patterns comprise the training set and are used to derive a classification algorithm. The remaining patterns are then used to test the classification algorithm; these patterns are collectively referred to as the test set. If the correct classes of the individual patterns in the test set are also known, then one can evaluate the performance of the algorithm. The state-of-the-art in feature matching techniques used in speaker recognition include Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). In this project, the VQ approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codewords is called a codebook. Figure 4.1 shows a conceptual diagram to illustrate this recognition process. In the figure, only two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase, using the clustering algorithm described in Section 4.2, a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result codewords (centroids) are shown in Figure 5 by black circles and black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ codebook with smallest total distortion is identified as the speaker of the input utterance.