This document discusses speech recognition techniques. It begins by defining biometrics and how speech can be used as a biometric for identity authentication. It describes how speech recognition aims to extract lexical information independently of the speaker, while speaker recognition focuses on extracting the identity of the speaker. The document then discusses feature extraction using MFCC and modeling speech using neural networks. It provides an overview of pattern recognition techniques including statistical and structural approaches. Finally, it discusses implementation details such as preprocessing, framing, windowing and feature extraction of speech signals.
Complete power point presentation on SPEECH RECOGNITION TECHNOLOGY.
Very helpful for final year students for their seminar.
One can use this presentation as their final year seminar.
Speech Recognition is a very interesting topic for seminar.
Complete power point presentation on SPEECH RECOGNITION TECHNOLOGY.
Very helpful for final year students for their seminar.
One can use this presentation as their final year seminar.
Speech Recognition is a very interesting topic for seminar.
speech recognition,History of speech recognition,what is speech recognition,Voice recognition software , Advantages and Disadvantages speech recognition, voice recognition,Voice recognition in operating systems ,Types of speech recognition
This power-point presentation contains 45 slides. It describes SR system (a brief intro), what are the applications, the biological architecture of human speech recognition vs machine architecture, recognition process, flow summery of recognition process and the approaches to the SRS. All this is described in the first few slides (the first part, let's say), after that, this presentation describes the evolution process of SRS through the decades (the middle part), and at the last this presentation describes the machine learning approach in SRS. How neural net enhance the efficiency of a SRS.
This is a ppt on speech recognition system or automated speech recognition system. I hope that it would be helpful for all the people searching for a presentation on this technology
This presentation was delivered to a "Web Enabled Business" class at Simon Fraser University in Vancouver. The topic is speech recognition technology, and the presentation covers its origins, how it works, issues, latest trends and future opportunities.
Deep Learning techniques have enabled exciting novel applications. Recent advances hold lot of promise for speech based applications that include synthesis and recognition. This slideset is a brief overview that presents a few architectures that are the state of the art in contemporary speech research. These slides are brief because most concepts/details were covered using the blackboard in a classroom setting. These slides are meant to supplement the lecture.
Also known as automatic speech recognition or computer speech recognition which means understanding voice by the computer and performing any required task.
Deep Learning for Speech Recognition - Vikrant Singh TomarWithTheBest
Tomar discusses the components of speech recognition, the difference between deep learning for speech and images, system architecture, GMM-HMM based systems, deep neural networks in speech, tandem DNN, and hybrids. There's a lot of exciting stuff to talk about in deep learning communities.
Vikrant Singh Tomar, Founder, Fluent.ai
speech recognition,History of speech recognition,what is speech recognition,Voice recognition software , Advantages and Disadvantages speech recognition, voice recognition,Voice recognition in operating systems ,Types of speech recognition
This power-point presentation contains 45 slides. It describes SR system (a brief intro), what are the applications, the biological architecture of human speech recognition vs machine architecture, recognition process, flow summery of recognition process and the approaches to the SRS. All this is described in the first few slides (the first part, let's say), after that, this presentation describes the evolution process of SRS through the decades (the middle part), and at the last this presentation describes the machine learning approach in SRS. How neural net enhance the efficiency of a SRS.
This is a ppt on speech recognition system or automated speech recognition system. I hope that it would be helpful for all the people searching for a presentation on this technology
This presentation was delivered to a "Web Enabled Business" class at Simon Fraser University in Vancouver. The topic is speech recognition technology, and the presentation covers its origins, how it works, issues, latest trends and future opportunities.
Deep Learning techniques have enabled exciting novel applications. Recent advances hold lot of promise for speech based applications that include synthesis and recognition. This slideset is a brief overview that presents a few architectures that are the state of the art in contemporary speech research. These slides are brief because most concepts/details were covered using the blackboard in a classroom setting. These slides are meant to supplement the lecture.
Also known as automatic speech recognition or computer speech recognition which means understanding voice by the computer and performing any required task.
Deep Learning for Speech Recognition - Vikrant Singh TomarWithTheBest
Tomar discusses the components of speech recognition, the difference between deep learning for speech and images, system architecture, GMM-HMM based systems, deep neural networks in speech, tandem DNN, and hybrids. There's a lot of exciting stuff to talk about in deep learning communities.
Vikrant Singh Tomar, Founder, Fluent.ai
Voice Identification And Recognition System, MatlabSohaib Tallat
A simple yet complex approach to modern sophistication.
Made this project using the MFCC approach and then embedding the code to a Graphical User Interface. In the end made a standalone application for the program using deployment tools of matlab
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
This paper presents an approach to speaker recognition using frequency spectral information with Mel frequency for the improvement of speech feature representation in a Vector Quantization codebook based recognition approach. The Mel frequency approach extracts the features of the speech signal to get the training and testing vectors. The VQ Codebook approach uses training vectors to form clusters and recognize accurately with the help of LBG algorithm.
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...gt_ebuddy
Joint Speech and Speaker Recognition using Hidden Markov Model/Vector Quantization for speaker independent Speech Recognition and Gaussian Mixture Model for speech independent speaker recognition- used MFCC (Mel-Frequency Cepstral Coefficient) for Feature Extraction (delta,delta delta and energy - 39 coefficients).
Developed in JAVA with client/server Architecture, web interface developed in Adobe Flex.
This project was done at TU, IOE - Pulchowk Campus, Nepal.
For more details visit http://ganeshtiwaridotcomdotnp.blogspot.com
ABSTRACT OF PROJECT>>>
Biometric is physical characteristic unique to each individual. It has a very useful application in authentication and access control.
The designed system is a text-prompted version of voice biometric which incorporates text-independent speaker verification and speaker-independent speech verification system implemented independently. The foundation for this joint system is that the speech signal conveys both the speech content and speaker identity. Such systems are more-secure from playback attack, since the word to speak during authentication is not previously set.
During the course of the project various digital signal processing and pattern classification algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy and their deltas as feature. Feature extraction module is same for both systems. Speaker modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated word modeling. And results of both systems were combined to authenticate the user.
The speech model for each word was pre-trained by using utterance of 45 English words. The speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering the individual words, the recognition rate of the speech recognition system is 92 % and speaker recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of speaker recognition system improves to 78%.
Identity authentication using voice biometrics techniqueeSAT Journals
Abstract
Identification of people using name, appearances, badges, tags and register may be effective may be in a small organization.
However, as the size of the organization or society increases, these simple ways of identifying individual become ineffective.
Therefore, it may be necessary to employ additional and more sophisticated means of authenticating the identity of people as the
population increases. Voice Biometrics is a method by which individuals can be uniquely identified by evaluating one or more
distinguishing biological traits associated with the voice of such individuals. In this paper, an unconstrained text-independent
recognition system using the Gaussian Mixture Model was applied to match recorded voice to stored voice for the purpose of
identification of individual. Recorded voices were processed and stored in the enrollment phase while probing voices were used
for comparison in the verification/recognition phase of the system.
Keywords: Model, Biometric, verification, enrollment, database, authentication, matching, identity.
In this paper we present the implementation of speaker identification system using artificial neural network
with digital signal processing. The system is designed to work with the text-dependent speaker
identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using
an audio wave recorder. The speech features are acquired by the digital signal processing technique. The
identification of speaker using frequency domain data is performed using backpropagation algorithm.
Hamming window and Blackman-Harris window are used to investigate better speaker identification
performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using back propagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using backpropagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Classification of Language Speech Recognition Systemijtsrd
This paper is aimed to implement Classification of Language Speech Recognition System by using feature extraction and classification. It is an Automatic language Speech Recognition system. This system is a software architecture which outputs digits from the input speech signals. The system is emphasized on Speaker Dependent Isolated Word Recognition System. To implement this system, a good quality microphone is required to record the speech signals. This system contains two main modules feature extraction and feature matching. Feature extraction is the process of extracting a small amount of data from the voice signal that can later be used to represent each speech signal. Feature matching involves the actual procedure to identify the unknown speech signal by comparing extracted features from the voice input of a set of known speech signals and the decision making process. In this system, the Mel frequency Cepstrum Coefficient MFCC is used for feature extraction and Vector Quantization VQ which uses the LBG algorithm is used for feature matching. Khin May Yee | Moh Moh Khaing | Thu Zar Aung "Classification of Language Speech Recognition System" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26546.pdfPaper URL: https://www.ijtsrd.com/computer-science/speech-recognition/26546/classification-of-language-speech-recognition-system/khin-may-yee
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Forensic and Automatic Speaker Recognition System IJECEIAES
Current Automatic Speaker Recognition (ASR) System has emerged as an important medium of confirmation of identity in many businesses, ecommerce applications, forensics and law enforcement as well. Specialists trained in criminological recognition can play out this undertaking far superior by looking at an arrangement of acoustic, prosodic, and semantic attributes which has been referred to as structured listening. An algorithmbased system has been developed in the recognition of forensic speakers by physics scientists and forensic linguists to reduce the probability of a contextual bias or pre-centric understanding of a reference model with the validity of an unknown audio s ample and any suspicious individual. Many researchers are continuing to develop automatic algorithms in signal processing and machine learning so that improving performance can effectively introduce the speaker’s identity, where the automatic system performs equally with the human audience. In this paper, I examine the literature about the identification of speakers by machines and humans, emphasizing the key technical speaker pattern emerging for the automatic technology in the last decade. I focus on many aspects of automatic speaker recognition (ASR) systems, including speaker-specific features, speaker models, standard assessment data sets, and performance metrics.
A Survey on Speech Recognition with Language Specificationijtsrd
As a cross disciplinary, speech recognition is entirely based on the speech as the survey object. Speech recognition allows the machine to convert the speech signal into text or commands via the process of identification and understanding. Speech recognition involves in various fields of physiology, psychology, linguistics, computer science and signal processing, and is even related to the person’s body language, and its goal is to achieve natural language communication between man and machine. The speech recognition technology is gradually becoming the key technology of the IT man machine interface. This paper describes the development of speech recognition technology and its basic principles, methods, reviewed the classification of speech recognition systems, speech recognition approaches and voice recognition technology, analyzed the problems faced by the speech recognition. Dr. Preeti Savant | Lakshmi Sandhya H "A Survey on Speech Recognition with Language Specification" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-3 , April 2022, URL: https://www.ijtsrd.com/papers/ijtsrd49370.pdf Paper URL: https://www.ijtsrd.com/computer-science/speech-recognition/49370/a-survey-on-speech-recognition-with-language-specification/dr-preeti-savant
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
High Level Speaker Specific Features as an Efficiency Enhancing Parameters in...IJECEIAES
In this paper, I present high-level speaker specific feature extraction considering intonation, linguistics rhythm, linguistics stress, prosodic features directly from speech signals. I assume that the rhythm is related to language units such as syllables and appears as changes in measurable parameters such as fundamental frequency ( ), duration, and energy. In this work, the syllable type features are selected as the basic unit for expressing the prosodic features. The approximate segmentation of continuous speech to syllable units is achieved by automatically locating the vowel starting point. The knowledge of high-level speaker’s specific speakers is used as a reference for extracting the prosodic features of the speech signal. High-level speaker-specific features extracted using this method may be useful in applications such as speaker recognition where explicit phoneme/syllable boundaries are not readily available. The efficiency of the particular characteristics of the specific features used for automatic speaker recognition was evaluated on TIMIT and HTIMIT corpora initially sampled in the TIMIT at 16 kHz to 8 kHz. In summary, the experiment, the basic discriminating system, and the HMM system are formed on TIMIT corpus with a set of 48 phonemes. Proposed ASR system shows 1.99%, 2.10%, 2.16% and 2.19 % of efficiency improvements compared to traditional ASR system for and of 16KHz TIMIT utterances.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. Chapter 2 | Speech Recognition
2.1 | INTRODUCTION
Biometrics is, in the simplest definition, something you are. It is a physical
characteristic unique to each individual such as fingerprint, retina, iris, speech.
Biometrics has a very usefulapplication in security; it can be used to authenticate a
person’s identity and control access toa restricted area, based on the premise that
the set of these physical characteristics can be used to uniquely identify
individuals. Speech signal conveys two important types of information, the
primarily the speech content and on the secondary level, the speaker identity.
Speech recognizers aim to extract the lexical information from the speech signal
independently of the speaker by reducing the inter-speaker variability. On the other
hand, speaker recognition is concerned with extracting the identity of the person
speaking the utterance. So both speech recognition and speaker recognition system
is possible from same voice input.
We use in our project the speech recognition technique because we want in our
project to recognize the word that the stick will make action depending on this
word.
Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and
speakerrecognition. We also combined energy features and delta and delta-delta
features of energy and MFCC.After calculating feature, neural networks are used
to model the speech recognition. Based on the speech model the system decides
whether or not the uttered speech matcheswhat was prompted to utter.
2.2 | LITERATURE REVIEW
2.2.1 | Pattern Recognition
Pattern recognition, one of the branches of artificial intelligence, sub-section of
machinelearning, is the study of how machines can observe the environment, learn
to distinguishpatterns of interest from their background, and make sound and
reasonable decisions aboutthe categories of the patterns. A pattern can be a
fingerprint image, a handwritten cursiveword, a human face, or a speech signal,
sales pattern etc…
The applications of pattern recognition include data mining, document
classification, financial forecasting, organization and retrieval of multimedia
databases, and biometrics (personal identification based on various physical
attributes such as face, retina, speech, earand fingerprints).The essential steps of
7
3. Chapter 2 | Speech Recognition
pattern recognition are: Data Acquisition, Preprocessing, FeatureExtraction,
Training and Classification.
Features are used to denote the descriptor. Features must be selected so that they
arediscriminative and invariant. They can be represented as a vector, matrix, tree,
graph, or string.
They are ideally similar for objects in the same class and very different for objects
indifferent class. Pattern class is a family of patterns that share some common
properties. Pattern recognition by machine involves techniques for assigning
patterns to their respective classes automatically and with as little human
intervention as possible.
Learning and Classification usually use one of the following approaches: Statistical
Pattern Recognition is based on statistical characterizations of patterns, assuming
that the patterns are generated by a probabilistic system. Syntactical (or Structural)
Pattern Recognition is basedon the structural interrelationships of features.Given a
pattern, its recognition/classification may consist of one of the following two
tasksaccording to the type of learning procedure:
1) Supervised Classification (e.g., DiscriminantAnalysis) in which the input pattern is
identified as a member of a predefined class.
2) Unsupervised Classification (e.g., clustering) in which the pattern is assigned to
a previously unknown class.
Fig. (2.1): General block diagram of pattern recognition system
8
4. Chapter 2 | Speech Recognition
2.2.2 | Generation of Voice
Speech begins with the generation of an airstream, usually by the lungs and
diaphragm -process called initiation. This air then passes through the larynx tube,
where it is modulatedby the glottis (vocal chords). This step is called phonation or
voicing, and is responsible fourth generation of pitch and tone. Finally, the
modulated air is filtered by the mouth, nose, and throat - a process called
articulation - and the resultant pressure wave excites the air.
Fig.(2.2): Vocal Schematic
Depending upon the positions of the various articulators different sounds are
produced. Position of articulators can be modeled by linear time- invariant system
that has frequencyresponse characterized by several peaks called formants. The
change in frequency of formants characterizes the phoneme being articulated.
As a consequence of this physiology, we can notice several characteristics of the
frequency domain spectrum of speech. First of all, the oscillation of the glottis
9
5. Chapter 2 | Speech Recognition
results in an underlying fundamental frequency and a series of harmonics at
multiples of this fundamental. This is shown in the figure below, where we have
plotted a brief audio waveform forthe phoneme /i: / and its magnitudespectrum.
The fundamental frequency (180 Hz) and its harmonics appear as spikes in the
spectrum. The location of the fundamental frequency is speaker dependent, and is a
function of the dimensions and tension of the vocal chords. Foradults it usually
falls between 100 Hz and 250 Hz, and females‟ average significantly higher than
that of males.
Fig. (2.3): Audio Sample for /i: / phoneme showing stationary property of phonemes for a short period
The sound comes out in phonemes which are the building blocks of speech. Each
phoneme resonates at a fundamental frequency and harmonics of it and thus has
high energy at those frequencies in other words have different formats. It is the
feature that enables the identification of each phoneme at the recognition stage.
The variations in
Fig.(2.4): Audio Magnitude Spectrum for /i:/ phoneme showing fundamental frequency and its harmonics
10
6. Chapter 2 | Speech Recognition
Inter-speaker features of speech signal during utterance of a word are modeled in
word training in speech recognition. And for speaker recognition the intra-speaker
variations in features in long speech content is modeled.
Besides the configuration of articulators, the acoustic manifestation of a phoneme
is affected by:
Physiology and emotional state of speaker.
Phonetic context.
Accent.
2.2.3 | Voice as Biometric
The underlying premise for voice authentication is that each person’s voice differs
in pitch, tone, and volume enough to make it uniquely distinguishable. Several
factors contribute tothis uniqueness: size and shape of the mouth, throat, nose, and
teeth (articulators) and thesize, shape, and tension of the vocal cords. The chance
that all of these are exactly the samein any two people is very low.Voice Biometric
has following advantages from other form of biometrics:
Natural signal to produce
Implementation cost is low since, doesn’t require specialized input device
Acceptable by user
Easily mixed with other form of authentication system for multifactor
authenticationonly biometric that allows users to authenticate remotely.
2.2.4 | Speech Recognition
Speech is the dominant means for communication between humans, and promises
to beimportant for communication between humans and machines, if it can just be
made a littlemore reliable.
Speech recognition is the process of converting an acoustic signal to a set of words.
Theapplications include voice commands and control, data entry, voice user
interface, automatingthe telephone operator’s job in telephony, etc. They can also
serve as the input to naturallanguage processing.There is two variant of speech
recognition based on the duration of speech signal:
Isolated word recognition, in which each word is surrounded by some sort of
pause, is much easier than recognizing continuous speech, in which words run into
each other and have to besegmented.Speech recognition is a difficult task because
11
7. Chapter 2 | Speech Recognition
of the many source of variability associatedwith the signal such as the acoustic
realizations of phonemes, the smallest sound units of which words are composed,
are highly dependent on the context. Acoustic variability canresult from changes in
the environment as well as in the position and characteristics of thetransducer.
Third, within speaker variability can result from changes in the speaker's
physicaland emotional state, speaking rate, or voice quality. Finally, differences in
socio linguistic background, dialect, and vocal tract size and shape can contribute
to cross-speaker variability. Such variability is modeled in various ways. At the
level of signal representation, therepresentation that emphasizes the speaker
independent features is developed.
2.2.5 | Speaker Recognition
Speaker recognitionis the process of automatically recognizing who is speaking on
the basisof individual’sinformation included in speech waves. Speaker recognition
can beclassifiedinto identification and verification. Speaker recognition has been
applied most often as meansof biometric authentication.
2.2.5.1 | Types of Speaker Recognition
Speaker Identification
Speaker identification is the process of determining which registered speaker
provides a given utterance. In Speaker Identification (SID) system, no identity
claim is provided, the test utterance is scored against a set of known (registered)
references for each potential speaker and the one whose model best matches the
test utterance is selected. Thereis two types of speaker identification task closed-set
and open-set speakeridentification .In closed-set, the test utterance belongs to one
of the registered speakers.
During testing, amatching score is estimated for each registered speaker. The
speaker corresponding to themodel with the best matching score is selected. This
requires N comparisons for a populationof N speakers. In open-set, any speaker
can access the system; those who are not registered should berejected. This
requires another model referred to as garbage model or imposter model
orbackground model, which is trained with data provided by other speakers
different from theregistered speakers.
During testing, the matching score corresponding to the best speakermodel is
compared with the matching score estimated using the garbage model. In order
toaccept or reject the speaker, making the total number of comparisons equal to N
12
8. Chapter 2 | Speech Recognition
+ 1. Speakeridentification performance tends to decrease as the population size
increases.
Speaker verification
Speaker verification, on the other hand, is the process of accepting or rejecting the
identityclaim of a speaker. That is, the goal is to automatically accept or reject an
identity that is claimed by the speaker. During testing, a verification score is
estimated using the claimedspeaker model and the anti-speaker model. This
verification score is then compared to athreshold. If the score is higher than the
threshold, the speaker is accepted, otherwise, thespeaker is rejected.
Thus, speaker verification, involves a hypothesis test requiring a simplebinary
decision: accept or reject the claimed identity regardless of the population size.
Hence,the performance is quite independent of the population size, but it depends
on the number of test utterances used to evaluate the performance of the system.
2.2.6 | Speaker/Speech Modeling
There are various pattern modeling/matching techniques. They include Dynamic
Time Warping (DTW), Gaussian Mixture Model (GMM), Hidden Markov
Modeling (HMM), Artificial Neural Network (ANN), and Vector Quantization
(VQ). These are interchangeably used for speech, speaker modeling. The best
approach is statistical learning methods: GMMfor Speaker Recognition, which
models the variations in features of a speaker for a long sequence of utterance.
And another statistical method widely used for speech recognition isHMM. HMM
models the Markovian nature of speech signal where each phoneme representsa
state and sequence of such phonemes represents a word. Sequence of Features of
suchphonemes from different speakers is modeled by HMM.
2.3 | IMPLEMENTATION DETAILS
The implementation of systemincludes common pre-processing and feature
extraction module, speaker independent speech modeling and classification by
ANNs.
2.3.1 | Pre-Processing and Feature Extraction
13
9. Chapter 2 | Speech Recognition
Starting from the capturing of audio signal, feature extraction consists of the
following stepsas shown in the block diagram below:
Speech
Signal
Silen
ce
remo
val
Preempha
sis
Framin
g
Windowi
ng
DF
T
Mel
Filter
Bank
Log
IDF
T
CM
S
Energy
Del
ta
MFCC
12ΔMFCC
12ΔΔMFCC
1 energy
1Δenergy
1ΔΔenergy
Fig.(2.5): Pre-Processing and Feature Extraction
2.3.1.1 | Capture
The first step in processing speech is to convert the analog representation (first air
pressure, and then analog electric signals in a microphone) into a digital signal
x[n], where n is anindex over time. Analysis of the audio spectrum shows that
nearly all energy resides in theband between DC and 4 kHz, and beyond 10 kHz
there is virtually no energy whatsoever.
Used sound format:
22050 Hz
16-bits, Signed
Little Endian
Mono Channel
Uncompressed PCM
2.3.1.2 | End point detection and Silence removal
The captured audio signal may contain silence at different positions such as
beginning of signal, in between the words of a sentence, end of signal…. etc.If
silent frames are included,modeling resources are spent on parts of the signal
which do not contribute to theidentification. The silence present must be removed
before further processing.There are several ways for doing this: most popular are
Short Time Energy and ZerosCrossing Rate. But they have their own limitation
regarding setting thresholds as an ad hocbasis. The algorithm we used uses
14
10. Chapter 2 | Speech Recognition
statistical properties of background noise as well asphysiological aspect of speech
production and does not assume any ad hoc threshold.
Itassumes that background noise present in the utterances is Gaussian in
nature.Usually first 200msec or more (we used 4410 samples for the sampling rate
22050samples/sec) of a speech recording corresponds to silence (or background
noise) because thespeaker takes some time to read when recording starts.
Endpoint Detection Algorithm:
Step 1:
Calculate the mean (μ) and standard deviation (σ) of the first 200ms samples of
thegiven utterance. The background noise is characterized by this μ and σ.
Step 2:
Go from 1st sample to the last sample of the speech recording. In each sample,
check whether one-dimensional Mahalanobis distance functions i.e. | x-μ |/
σgreater than 3 or not. If Mahalanobis distance function is greater than 3, the
sample is to be treated as voiced sampleotherwise it is an unvoiced/silence.The
threshold reject the samples up to 99.7% as per given by P [|x−μ|≤3σ] =0.997 in a
Gaussian distribution thus accepting only the voiced samples.
Step 3:
Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech
signal into 10 ms non-overlapping windows. Represent the complete speech by
only zerosand ones.
Step 4:
Consider there are M number of zeros and N number of ones in a window. If M ≥
N then convert each of ones to zeros and vice versa. This method adopted here
keeping in mindthat a speech production system consisting of vocal cord, tongue,
vocal tract etc. cannotchange abruptly in a short period of time window taken here
as 10ms.
Step 5:
Collect the voiced part only according to the labeled „1‟ samples from the
windowed array and dump it in a new array. Retrieve the voiced part of the
original speech signal from labeled 1 sample.
15
11. Chapter 2 | Speech Recognition
Fig. (2.6): Input signal to End-point detection system
Fig. (2.7): Output signal from End point Detection System
2.3.1.3 | PCM Normalization
The extracted pulse code modulated values of amplitude is normalized, to
avoid amplitude variation during capturing.
2.3.1.4 |Pre-emphasis
Usually speech signal is pre-emphasized before any further processing, if we
look at the spectrum for voiced segments like vowels, there is more energy at
lower frequencies than the higher frequencies. This drop in energy across
frequencies is caused by the nature of the glottal pulse. Boosting the high
frequency energy makes information from these higher formants more available to
the acoustic model and improves phone detection accuracy. The pre-emphasis filter
is a first-order high-pass filter. In the time domain, with inputx[n]and 0.9 ≤ α ≤ 1.0,
the filter equation is:
y[n] =x[n]− α x[n−1]
We used α=0.95.
16
12. Chapter 2 | Speech Recognition
Fig. (2.8): Signal before Pre-Emphasis
Fig.(2.9): Signal after Pre-Emphasis
2.3.1.5 | Framing and windowing
Speech is anon-stationarysignal, meaning that its statistical properties are not
constantacross time. Instead, we want to extract spectral features from a
smallwindowof speech thatcharacterizes a particular sub phone and for which we
can make the (rough) assumption thatthe signal isstationary(i.e. its statistical
properties are constant within this region).We used frame block of 23.22ms with
50% overlapping i.e., 512 samples per frame.
17
13. Chapter 2 | Speech Recognition
Fig.(2.10): Frame Blocking of the Signal
The rectangular window (i.e., no window) can cause problems, when we do
Fourier analysis; it abruptly cuts of the signal at its boundaries. A good window
function has a narrow main lobe and low side lobe levels in their transfer functions,
which shrinks the values of the signal toward zero at the window boundaries,
avoiding discontinuities. The most commonly used window function in speech
processing is the Hamming window defined as follows:
Fig.(2.11): Hamming window
The extraction of the signal takes place by multiplying the value of the signal at
time n, s frame [n], with the value of the window at time n, Sw [n]:
18
14. Chapter 2 | Speech Recognition
Y[n] = Sw[n] × Sframe[n]
Fig.(2.12): A single frame before and after windowing
2.3.1.6 | Discrete Fourier Transform
A Discrete Fourier Transform (DFT) of the windowed signal is used to extract
the frequency content (the spectrum) of the current frame. The tool for extracting
spectral information i.e., how much energy the signal contains at discrete
frequency bands for a discrete-time (sampled) signal is the Discrete Fourier
Transform or DFT. The input to the DFT is a windowed signal x[n]...x[m], and the
output, for each of N discrete frequency bands, is a complex number X[k]
representing the magnitude and phase of that frequency component in the original
signal.
The commonly used algorithm for computing the DFT is theFast Fourier
Transformor in short FFT.
2.3.1.7 | Mel Filter
For calculating the MFCC, first, a transformation is applied according to the
followingformula:
19
15. Chapter 2 | Speech Recognition
Where, x is the linear frequency. Then, a filter bank is applied to the amplitude of
the Mel-scaled spectrum. The Mel frequency warping is most conveniently done
by utilizing a filter bank with filters centered according to Mel frequencies. The
width of the triangular filters varies according tothe Mel scale, so that the log total
energy in a critical band around the center frequency isincluded. The centers of the
filters are uniformly spaced in the Mel scale.
Fig.(2.13): Equally spaced Mel values
The result of Mel filter is information about distribution of energy at each Mel
scale band. We obtain a vector of outputs (12 coeffs.) from each filter.
Fig.(2.13): Triangular filter bank in frequency scale
20
16. Chapter 2 | Speech Recognition
We have used 30 filters in the filter bank.
2.3.1.8 | Cestrum by Inverse Discrete Fourier Transform
Cestrum transform is applied to the filter outputs in order to obtain MFCC feature
of each frame. The triangular filter outputs Y (i), i=0, 1, 2… M are compressed
using logarithm, and discrete cosine transform (DCT) is applied. Here, M is equal
to number of filters in filter bank i.e., 30.
Where, C[n] is the MFCC vector for each frame.
The resulting vector is called the Mel-frequency cepstrum (MFC), and the
individualcomponents are the Mel-frequency Cepstral coefficients (MFCCs). We
extracted 12 featuresfrom each speech frame.
2.3.1.9 | Post Processing
Cepstral Mean Subtraction (CMS)
A speech signal may be subjected to some channel noise when recorded, also
referred to asthe channel effect. A problem arises if the channel effect when
recording training data for agiven person is different from the channel effect in
later recordings when the person uses thesystem. The problem is that a false
distance between the training data and newly recordeddata is introduced due to the
different channel effects. The channel effect is eliminated bysubtracting the Melcepstrum coefficients with the mean Mel-cepstrum coefficients:
The energy feature
The energy in a frame is the sum over time of the power of the samples in the
frame; thus for a signal x in a window from time sample t1 to time sample t2 the
energy is:
Delta feature
21
17. Chapter 2 | Speech Recognition
Another interesting fact about the speech signal is that it is not constant from frame
to frame.Co-articulation (influence of a speech sound during another adjacent or
nearby speech sound)can provide a useful cue for phone identity. It can be
preserved by using delta features.Velocity (delta) and acceleration (delta delta)
coefficients are usually obtained from the staticwindow based information. This
delta and delta delta coefficients model the speed andacceleration of the variation
of Cepstral feature vectors across adjacent windows.A simple way to compute
deltas would be just to compute the difference between frames; thus the delta value
d(t ) for a particular Cepstral value c (t) at time t can be estimated as:
The differentiating method is simple, but since it acts as a high-pass filtering
operation on theparameter domain, it tends to amplify noise. The solution to this is
linear regression, i.e. first-order polynomial, the least squares solution is easily
shown to be of the following form:
Where, M is regression window size. We used M=4.
Composition of Feature Vector
We calculated 39 Features from each frame:
12 MFCC Features.
12 Deltas MFCC.
12 Delta-Deltas MFCC.
1 Energy Feature.
1 Delta Energy Feature.
1 Delta-Delta Energy Feature.
22