Speech recognition-using-wavelet-transform

5,903 views
5,784 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,903
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
409
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Speech recognition-using-wavelet-transform

  1. 1. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 1. INTRODUCTION Automatic speech recognition (ASR) aims at converting spoken language to text.Scientists all over the globe have been working under the domain, speech recognition for lastmany decades. This is one of the intensive areas of research. Recent advances in softcomputing techniques give more importance to automatic speech recognition. Large variationin speech signals and other criteria like native accent and varying pronunciations makes thetask very difficult. ASR is hence a complex task and it requires more intelligence to achieve agood recognition result. Speech recognition is currently used in many real-time applications, such as cellulartelephones, computers, and security systems. However, these systems are far from perfect incorrectly classifying human speech into words. Speech recognizers consist of a featureextraction stage and a classification stage. The parameters from the feature extraction stageare compared in some form to parameters extracted from signals stored in a database ortemplate. The parameters could be fed to a neural network. Speech word recognition systems commonly carry out some kind of classificationrecognition based on speech features which are usually obtained via Fourier Transforms(FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques.However, these methods have some disadvantages. These methods accept signal stationaritywithin a given time frame and may therefore lack the ability to analyze localized eventscorrectly. The wavelet transform copes with some of these problems. Other factorsinfluencing the selection of Wavelet Transforms (WT) over conventional methods includetheir ability to determine localized features. Discrete Wavelet Transform method is used forspeech processing. 1
  2. 2. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 2. LITERATURE SURVEY Designing a machine that mimics human behavior, particularly the capability ofspeaking naturally and responding properly to spoken language, has intrigued engineers andscientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed asystem model for speech analysis and synthesis, the problem of automatic speech recognitionhas been approached progressively, from a simple machine that responds to a small set ofsounds to a sophisticated system that responds to fluently spoken natural language and takesinto account the varying statistics of the language in which the speech is produced. Based onmajor advances in statistical modeling of speech in the 1980s, automatic speech recognitionsystems today find widespread application in tasks that require a human-machine interface,such as automatic call processing in the telephone network and query-based informationsystems that do things like provide updated travel information, stock price quotations,weather reports, etc. Speech is the primary means of communication between people. For reasonsranging from technological curiosity about the mechanisms for mechanical realization ofhuman speech capabilities, to the desire to automate simple tasks inherently requiring human-machine interactions, research in automatic speech recognition (and speech synthesis) bymachine has attracted a great deal of attention over the past five decades. The desire for automation of simple tasks is not a modern phenomenon, but onethat goes back more than one hundred years in history. By way of example, in 1881Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter invented arecording device that used a rotating cylinder with a wax coating on which up-and-downgrooves could be cut by a stylus, which responded to incoming sound pressure (in much thesame way as a microphone that Bell invented earlier for use with the telephone). Based onthis invention, Bell and Tainter formed the Volta Graphophone Co. in 1888 in order tomanufacture machines for the recording and reproduction of sound in office environments.The American Graphophone Co., which later became the Columbia Graphophone Co.,acquired the patent in 1907 and trademarked the term “Dictaphone.” Just about the same 2
  3. 3. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/time, Thomas Edison invented the phonograph using a tinfoil based cylinder, which wassubsequently adapted to wax, and developed the “Ediphone” to compete directly withColumbia. The purpose of these products was to record dictation of notes and letters for asecretary (likely in a large pool that offered the service) who would later type them out(offline), thereby circumventing the need for costly stenographers. This turn-of-the-century concept of “office mechanization” spawned a range ofelectric and electronic implements and improvements, including the electric typewriter,which changed the face of office automation in the mid-part of the twentieth century. It doesnot take much imagination to envision the obvious interest in creating an “automatictypewriter” that could directly respond to and transcribe a human‟s voice without having todeal with the annoyance of recording and handling the speech on wax cylinders or otherrecording media. A similar kind of automation took place a century later in the 1990‟s in the areaof “call centers.” A call center is a concentration of agents or associates that handle telephonecalls from customers requesting assistance. Among the tasks of such call centers are routingthe in-coming calls to the proper department, where specific help is provided or wheretransactions are carried out. One example of such a service was the AT&T Operator linewhich helped a caller place calls, arrange payment methods, and conduct credit cardtransactions. The number of agent positions (or stations) in a large call center could reachseveral thousand Automatic speech recognition.  From Speech Production Models to Spectral Representations Attempts to develop machines to mimic a human‟s speech communicationcapability appear to have started in the 2nd half of the 18th century. The early interest was noton recognizing and understanding speech but instead on creating a speaking machine,perhaps due to the readily available knowledge of acoustic resonance tubes which were usedto approximate the human vocal tract. In 1773, the Russian scientist Christian Kratzenstein, aprofessor of physiology in Copenhagen, succeeded in producing vowel sounds usingresonance tubes connected to organ pipes. Later, Wolfgang von Kempelen in Viennaconstructed an “Acoustic-Mechanical Speech Machine” (1791) and in the mid-1800s CharlesWheatstone [6] built a version of von Kempelens speaking machine using resonators made of 3
  4. 4. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/leather, the configuration of which could be altered or controlled with a hand to producedifferent speech-like sounds. During the first half of the 20th century, work by Fletcher [8] and others at BellLaboratories documented the relationship between a given speech spectrum (which is thedistribution of power of a speech sound across frequency), and its sound characteristics aswell as its intelligibility, as perceived by a human listener. In the 1930‟s Homer Dudley,influenced greatly by Fletcher‟s research, developed a speech synthesizer called the VODER(Voice Operating Demonstrator), which was an electrical equivalent (with mechanicalcontrol) of Wheatstone‟s mechanical speaking machine. Dudley‟s VODER which consistedof a wrist bar for selecting either a relaxation oscillator output or noise as the driving signal,and a foot pedal to control the oscillator frequency (the pitch of the synthesized voice). Thedriving signal was passed through ten band pass filters whose output levels were controlledby the operator‟s fingers. These ten band pass filters were used to alter the power distributionof the source signal across a frequency range, thereby determining the characteristics of thespeech-like sound at the loudspeaker. Thus to synthesize a sentence, the VODER operatorhad to learn how to control and “play” the VODER so that the appropriate sounds of thesentence were produced. The VODER was demonstrated at the World Fair in New York Cityin 1939 and was considered an important milestone in the evolution of speaking machines. Speech pioneers like Harvery Fletcher and Homer Dudley firmly established theimportance of the signal spectrum for reliable identification of the phonetic nature of a speechsound. Following the convention established by these two outstanding scientists, mostmodern systems and algorithms for speech recognition are based on the concept ofmeasurement of the (time-varying) speech power spectrum (or its variants such as thecepstrum), in part due to the fact that measurement of the power spectrum from a signal isrelatively easy to accomplish with modern digital signal processing techniques.  Early Automatic Speech Recognizers Early attempts to design systems for automatic speech recognition were mostlyguided by the theory of acoustic-phonetics, which describes the phonetic elements of speech(the basic sounds of the language) and tries to explain how they are acoustically realized in aspoken utterance. These elements include the phonemes and the corresponding place andmanner of articulation used to produce the sound in various phonetic contexts. For example, 4
  5. 5. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/in order to produce a steady vowel sound, the vocal cords need to vibrate (to excite the vocaltract), and the air that propagates through the vocal tract results in sound with natural modesof resonance similar to what occurs in an acoustic tube. These natural modes of resonance,called the formants or formant frequencies, are manifested as major regions of energyconcentration in the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of BellLaboratories built a system for isolated digit recognition for a single speaker, using theformant frequencies measured (or estimated) during vowel regions of each digit. Thesetrajectories served as the “reference pattern” for determining the identity of an unknown digitutterance as the best matching digit. In other early recognition systems of the 1950‟s, Olson and Belar of RCALaboratories built a system to recognize 10 syllables of a single talker and at MIT LincolnLab, Forgie and Forgie built a speaker-independent 10-vowel recognizer. In the 1960‟s,several Japanese laboratories demonstrated their capability of building special purposehardware to perform a speech recognition task. Most notable were the vowel recognizer ofSuzuki and Nakata at the Radio Research Lab in Tokyo, the phoneme recognizer of Sakai andDoshita at Kyoto University, and the digit recognizer of NEC Laboratories [14]. The work ofSakai and Doshita involved the first use of a speech segmenter for analysis and recognition ofspeech in different portions of the input utterance. In contrast, an isolated digit recognizerimplicitly assumed that the unknown utterance contained a complete digit (and no otherspeech sounds or words) and thus did not need an explicit “segmenter.” Kyoto University‟swork could be considered a precursor to a continuous speech recognition system. In another early recognition system Fry and Denes, at University College inEngland, built a phoneme recognizer to recognize 4 vowels and 9 consonants. Byincorporating statistical information about allowable phoneme sequences in English, theyincreased the overall phoneme recognition accuracy for words consisting of two or morephonemes. This work marked the first use of statistical syntax (at the phoneme level) inautomatic speech recognition. An alternative to the use of a speech segmenter was theconcept of adopting a non-uniform time scale for aligning speech patterns. This conceptstarted to gain acceptance in the 1960‟s through the work of Tom Martin at RCALaboratories and Vintsyuk in the Soviet Union. Martin recognized the need to deal with thetemporal non-uniformity in repeated speech events and suggested a range of solutions,including detection of utterance endpoints, which greatly enhanced the reliability of therecognizer performance. Vintsyuk proposed the use of dynamic programming for time 5
  6. 6. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/alignment between two utterances in order to derive a meaningful assessment of theirsimilarity. His work, though largely unknown in the West, appears to have preceded that ofSakoe and Chiba as well as others who proposed more formal methods, generally known asdynamic time warping, in speech pattern matching. Since the late 1970‟s, mainly due to thepublication by Sakoe and Chiba, dynamic programming, in numerous variant forms(including the Viterbi algorithm [19] which came from the communication theorycommunity), has become an indispensable technique in automatic speech recognition.  Advancement in technology Figure shows a timeline of progress in speech recognition and understandingtechnology over the past several decades. We see that in the 1960‟s we were able torecognize small vocabularies (order of 10-100 words) of isolated words, based on simpleacoustic-phonetic properties of speech sounds. The key technologies that were developedduring this time frame were filter-bank analyses, simple time normalization methods, and thebeginnings of sophisticated dynamic programming methodologies. In the 1970‟s we wereable to recognize medium vocabularies (order of 100-1000 words) using simple template-based, pattern recognition methods. The key technologies that were developed during thisperiod were the pattern recognition models, the introduction of LPC methods for spectralrepresentation, the pattern clustering methods for speaker-independent recognizers, and theintroduction of dynamic programming methods for solving connected word recognitionproblems. In the 1980‟s we started to tackle large vocabulary (1000-unlimited number ofwords) speech recognition problems based on statistical methods, with a wide range ofnetworks for handling language structures. The key technologies introduced during thisperiod were the hidden Markov model (HMM) and the stochastic language model, whichtogether enabled powerful new methods for handling virtually any continuous speechrecognition problem efficiently and with high performance. In the 1990‟s we were able tobuild large vocabulary systems with unconstrained language models, and constrained tasksyntax models for continuous speech recognition and understanding. The key technologiesdeveloped during this period were the methods for stochastic language understanding,statistical learning of acoustic and language models, and the introduction of finite statetransducer framework (and the FSM Library) and the methods for their determination andminimization for efficient implementation of large vocabulary speech understanding systems. 6
  7. 7. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/Finally, in the last few years, we have seen the introduction of very large vocabulary systemswith full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systemswith a range of input and output modalities for ease-of-use and flexibility in handling adverseenvironments where speech might not be as suitable as other input-output modalities. Duringthis period we have seen the emergence of highly natural concatenative speech synthesissystems, the use of machine learning to improve both speech understanding and speechdialogs, and the introduction of mixed-initiative dialog systems to enable user control whennecessary. After nearly five decades of research, speech recognition technologies have finallyentered the marketplace, benefiting the users in a variety of ways. Throughout the course ofdevelopment of such systems, knowledge of speech production and perception was used inestablishing the technological foundation for the resulting speech recognizers. Majoradvances, however, were brought about in the 1960‟s and 1970‟s via the introduction ofadvanced speech representations based on LPC analysis and cepstral analysis methods, and inthe 1980‟s through the introduction of rigorous statistical methods based on hidden Markovmodels. All of this came about because of significant research contributions from academia,private industry and the government. As the technology continues to mature, it is clear thatmany new applications will emerge and become part of our way of life – thereby taking fulladvantage of machines that are partially able to mimic human speech capabilities. 7
  8. 8. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 8
  9. 9. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 3. METHODOLOGY OF THE PROJECTThe methodology of the project involves the following steps 1. Database collection 2. Decomposition of the speech signal 3. Feature vectors extraction 4. Developing a classifier 5. Training the classifier 6. Testing the classifierEach of the section is discussed in detail below  3.1 Database collection Database collection is the most important step in speech recognition. Only anefficient database can yield a good speech recognition system. As we know different peoplesay words differently. This is due to the difference in the pitch, slang, pronunciation. In thisstep the same word is recorded by different persons. All words are recorded at the samefrequency 16KHz. Collection of too much samples need not benefit the speech recognition.Sometimes it can affect it adversely. So, right number of samples should be taken. The samestep is repeated for other words also.  3.2 Decomposition of speech signal The next step is speech signal decomposition. For this we can use differenttechniques like LPC, MFCC, STFT, wavelet transform. Over the past 10 years wavelettransform is mostly used in speech recognition. Speech recognition systems generally carry 9
  10. 10. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/out some kind of classification/recognition based upon speech features which are usuallyobtained via time-frequency representations such as Short Time Fourier Transforms (STFTs)or Linear Predictive Coding (LPC) techniques. In some respects, these methods may not besuitable for representing speech; they assume signal stationarity within a given time frameand may therefore lack the ability to analyze localized events accurately. Furthermore, theLPC approach assumes a particular linear (all-pole) model of speech production whichstrictly speaking is not the case. Other approaches based on Cohen‟s general class of time-frequency distributionssuch as the Cone-Kernel and Choi-Williams methods have also found use in speechrecognition applications but have the drawback of introducing unwanted cross-terms into therepresentation. The Wavelet Transform overcomes some of these limitations; it can provide aconstant-Q analysis of a given signal by projection onto a set of basic functions that are scalevariant with frequency. Each wavelet is a shifted scaled version of an original or motherwavelet. These families are usually orthogonal to one another, important since this yieldscomputational efficiency and ease of numerical implementation. Other factors influencing thechoice of Wavelet Transforms over conventional methods include their ability to capturelocalized features. Tiling of time frequency plane via the wavelet transform 10
  11. 11. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Wavelet Transform The Wavelet transform provides the time-frequency representation. (There are othertransforms which give this information too, such as short time Fourier transform, Wignerdistributions, etc.) Often times a particular spectral component occurring at any instant can be ofparticular interest. In these cases it may be very beneficial to know the time intervals theseparticular spectral components occur. For example, in EEGs, the latency of an event-relatedpotential is of particular interest (Event-related potential is the response of the brain to aspecific stimulus like flash-light, the latency of this response is the amount of time elapsedbetween the onset of the stimulus and the response).Wavelet transform is capable of providing the time and frequency informationsimultaneously. Wavelet transform can be applied to non-stationary signals. It concentrates intosmall portions of the signal which can be considered as stationary. It has got a variable sizewindow unlike constant size window in STFT. WT gives us information about what band offrequencies is there in a given interval of time. 11
  12. 12. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ There are two methodologies for speech decomposition using wavelet. DiscreteWavelet Transform (DWT) and Wavelet Packet Decomposition (WPD). Out of the two DWTis used in our project.  Discrete Wavelet Transform The transform of a signal is just another form of representing the signal. It doesnot change the information content present in the signal. For many signals, the low-frequencypart contains the most important part. It gives an identity to a signal. Consider the humanvoice. If we remove the high-frequency components, the voice sounds different, but we canstill tell what‟s being said. In wavelet analysis, we often speak of approximations and details.The approximations are the high- scale, low-frequency components of the signal. The detailsare the low-scale, high frequency components. The DWT is defined by the followingequation:Where ψ(t) is a time function with finite energy and fast decay called the mother wavelet.The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-ratefilter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bankwith octave spacing between the centers of the filters. Each sub-band contains half thesamples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signalis analyzed at different frequency bands with different resolution by decomposing the signalinto a coarse approximation and detail information. The coarse approximation is then furtherdecomposed using the same wavelet decomposition step. This is achieved by successivehigh-pass and low-pass filtering of the time domain signal and is defined by the followingequations: 12
  13. 13. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Figure 1: Signal x[n] is passed through lowpass and highpass filters and it is down sampled by 2 In the DWT, each level is calculated by passing the previous approximationcoefficients though a high and low pass filters. However, in the WPD, both the detail andapproximation coefficients are decomposed. Figure 2: Decomposition Tree The DWT is computed by successive low-pass and high-pass filtering of thediscrete time-domain signal as shown in figure 1 and 2. This is called the Mallat algorithm orMallat-tree decomposition. The mother wavelet used is daubichies 4 type wavelet. It contains more numberof filters. Daubichies wavelets are the most popular wavelets. They represent the foundationsof wavelet signal processing and are used in numerous applications. These are also calledMaxflat wavelets as their frequency responses have maximum flatness at frequencies 0 and π 13
  14. 14. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Daubechies wavelet of order 4  3.3 Feature vectors extraction Feature extraction is the key for ASR, so that it is arguably the most importantcomponent of designing an intelligent system based on speech/speaker recognition, since thebest classifier will perform poorly if the features are not chosen well. A feature extractorshould reduce the pattern vector (i.e., the original waveform) to a lower dimension, whichcontains most of the useful information from the original vector. The extracted wavelet coefficients provide a compact representation thatshows the energy distribution of the signal in time and frequency. In order to furtherreduce the dimensionality of the extracted feature vectors, statistics over the set of thewavelet coefficients are used. That way the statistical characteristics of the “texture” or the“music surface” of the piece can be represented. For example the distribution of energy intime and frequency for music is different from that of speech. The following features are used in our system:  The mean of the absolute value of the coefficients in each sub-band. These features provide information about the frequency distribution of the audio signal.  The standard deviation of the coefficients in each sub-band. These features provide information about the amount of change of the frequency distribution.  Energy of each sub-band of the signal. These features provide information about the energy of the each sub-band.  Kurtosis of each sub-band of the signal. These features measure whether the data are peaked or flat relative to a normal distribution. 14
  15. 15. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Skewness of each sub-band of the signals. These features are the measure of symmetry or lack of symmetry. These features are then combined into a hybrid feature and are fed to a classifier.Features are combined using a matrix. All the features of one sample correspond to a column.  3.4 Developing a classifier Generally, there are three usual methods in speech recognition: Dynamic TimeWarping (DTW), Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs). Dynamic time warping (DTW) is a technique that finds the optimal alignmentbetween two time series if one time series may be warped non-linearly by stretching orshrinking it along its time axis. This warping between two time series can then be used to findcorresponding regions between the two time series or to determine the similarity between thetwo time series. In speech recognition Dynamic time warping is often used to determine if twowaveforms represent the same spoken phrase. This method is used for time adjustment of twowords and estimation their difference. In a speech waveform, the duration of each spokensound and the interval between sounds are permitted to vary, but the overall speechwaveforms must be similar. Main problem of this systems is little amount of learning wordshigh calculating rate and large memory requirement. Hidden Markov Models are finite automates, having a given number of states;passing from one state to another is made instantaneously at equally spaced time moments.At every pass from one state to another, the system generates observations, two processes aretaking place: the transparent one, represented by the observations string (feature sequence),and the hidden one, which cannot be observed, represented by the state string. Main point ofthis method is timing sequence and comparing methods. Nowadays, ANNs are utilized in wide ranges for their parallel distributedprocessing, distributed memories, error stability, and pattern learning distinguishing ability.The Complexity of all these systems increased when their generality rises. The biggest 15
  16. 16. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/restriction of two first methods is their low speed for searching and comparing in models. ButANNs are faster, because output is resulted from multiplication of adjusted weights in presentinput. At present TDNN (Time-Delay Neural Network) is widely used in speech recognition.  Neural Networks A neural network (NN) is a massive processing system that consists of manyprocessing entities connected through links that represent the relationship between them. AMultilayer Perceptron (MLP) network consists of an input layer, one or more hidden layers,and an output layer. Each layer consists of multiple neurons. An artificial neuron is thesmallest unit that constitutes the artificial neural network. The actual computation andprocessing of the neural network happens inside the neuron. In this work, we use anarchitecture of the MLP networks which is the feed-forward network with back-propagationtraining algorithm (FFBP). In this type of network, the input is presented to the network andmoves through the weights and nonlinear activation functions toward the output layer, andthe error is corrected in a backward direction using the well-known error back-propagationcorrection algorithm. The FFBP is best suited for structural pattern recognition. In structuralpattern recognition tasks, there are N training examples, where each training example consistsof a pattern and a target class (x,y). These examples are assumed to be generatedindependently according to the joint distribution P(x,y). A structural classifier is then definedas a function h that performs the static mapping from patterns to target classes y=h(x). Thefunction h is usually produced by searching through a space of candidate classifiers andreturning the function h that performs well on the training examples during a learningprocess. A neural network returns the function h in the form of a matrix of weights. 16
  17. 17. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ An Artificial Neuron The number of neurons in each hidden layer has a direct impact on theperformance of the network during training as well as during operation. Having moreneurons than needed for a problem runs the network into an over fitting problem. Over fittingproblem is a situation whereby the network memorizes the training examples. Networks thatrun into over fitting problem perform well on training examples and poorly on unseenexamples. Also having less number of neurons than needed for a problem causes the networkto run into under fitting problem. The under fitting problem happens when the networkarchitecture does not cope with the complexity of the problem in hand. The under fittingproblem results in an inadequate modeling and therefore poor performance of the network. MLP Neural network architecture 17
  18. 18. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ The Backpropagation Algorithm The backpropagation algorithm (Rumelhart and McClelland, 1986) is used inlayered feed-forward ANNs. This means that the artificial neurons are organized in layers,and send their signals “forward”, and then the errors are propagated backwards. The networkreceives inputs by neurons in the input layer, and the output of the network is given by theneurons on an output layer. There may be one or more intermediate hidden layers. Thebackpropagation algorithm uses supervised learning, which means that we provide thealgorithm with examples of the inputs and outputs we want the network to compute, and thenthe error (difference between actual and expected results) is calculated. The idea of thebackpropagation algorithm is to reduce this error, until the ANN learns the training data. Thetraining begins with random weights, and the goal is to adjust them so that the error will beminimal.  3.5 Training the classifier After development the classifier has got 2 steps. Training and testing. Intraining phase the features of the samples are fed as input to the ANN. The target is set. Thenthe network is trained. The network will adjust its weights such that the target is achieved forthe given input. In this project we have used the function „tansig‟ and „logsig‟. So the outputshould be bounded between 0 and 1. The output is given as .9 .1 .1……1 for 1st word. .1 .9.1…….1 for 2nd word and so on. The position of maximum value corresponds to the output.  3.6 Testing the classifier The next phase is testing. The samples which are set aside for testing is givento the classifier and the output is noted. If we don‟t get the desired output ,we reach therequired output by adjusting the number of neurons. 18
  19. 19. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 4. OBSERVATION We recorded five Malayalam words “onnu”, ”randu”, ”naalu” , ”anju “ and“aaru” .The words corresponds to Malayalam words for numerals 1,2,4 ,5 and 6 respectively.The reason for specifically selecting these words was that,the project was intended toimplement a password system with numerals. Malayalam Word Numeral ഒ 1 2 4 5 6 19
  20. 20. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 20 samples for each word was recorded from different people and these samples werethen normalized by dividing their maximum values.Then they were decomposed usingwavelet transform technique upto eight levels since majority of the information about thesignal is present in the low frequency region. In order to classify the signals an ANN is developed and trained by fixing outputs suchthatIf the word is „onnu‟ then output will be .9 .1 .1 .1 .1If the word is „randu‟ then output will be .1 .9 .1 .1 .1If the word is „naalu‟ then output will be .1 .1 .9 .1 .1If the word is „anju‟ then output will be .1 .1 .1 .9 .1If the word is „aaru‟ then output will be .1 .1 .1 .1 .9 Out of 20 samples recorded,16 samples are used to train the ANN and the unused 4 samples areused for test purpose. Plots Plot for word ‘onnu’ 20
  21. 21. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot for word ‘randu’ Plot for word ‘naalu’ 21
  22. 22. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot for word ‘anju’ Plot for word ‘aaru’ 22
  23. 23. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  DWT TreeThe 8 level decomposition tree for a signal using DWT is shown in thefigure,which produces one approximation coefficient and eight detailedcoefficients 23
  24. 24. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Decomposed waveforms for word ‘randu’ 24
  25. 25. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Decomposed waveforms for word ‘aaru’ 25
  26. 26. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 5. TESTING AND RESULT  Testing with pre-recorded samples Out of the 20 samples recorded for each word, 16 were used for training purpose.We tested our program‟s accuracy with these 4 unused samples. A total of 20 samples weretested ( 4 samples each for the 5 words) and the program yielded the right result for all 20samples. Thus, we obtained 100% accuracy with pre- recorded samples.  Real-time testing: For real-time testing, we took a sample using microphone and directly executed theprogram using this sample. A total of 30 samples were tested, out of which 20 samples gavethe right result. This gives an accuracy of about 66% with real-time samples. 26
  27. 27. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/  Change in efficiency by changing the parameters of the ANN were observed and are plotted belowPlot 1: Accuracy with 2 layer feed forward network,Number of neurons in the first layer=15 27
  28. 28. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/Plot 2: Accuracy with 2 layer feed forward network ,Number of neurons in the first layer=20 28
  29. 29. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ Plot 3: Accuracy with 3 layer feed forward network,Number of neurons in the first layer,N1=15& number of neurons in the second layer,N2=5 29
  30. 30. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 7. CONCLUSION Speech recognition is one of the advanced areas. Many research works has beentaking place under this domain to implement new and enhanced approaches. During theexperiment we experienced the effectiveness of Daubechies4 mother wavelet in featureextraction. In this experiment we have only used a limited number of samples. Increasing thenumber of samples may give better feature and a good recognition result for Malayalam wordutterances. The performance of Neural Network with wavelet is appreciable. We have usedsoftware with some limitations, if we increase the number of samples as well as the numberiterations (training), it can produce a good recognition result. We also observed that, Neural Network is an effective tool which can be embeddedsuccessfully with wavelet. The effectiveness of wavelet based feature extraction with otherclassification methods like neuro-fuzzy and genetic algorithm techniques can be used to dothe same task. From this study we could understand and experience the effectiveness of discretewavelet transform in feature extraction. Our recognition results under different kind of noiseand noisy conditions, show that choosing dyadic bandwidths have better performance thanchoosing equal bandwidths in sub-band recombination. This result adapts to way whichhuman ear recognizes speech and shows a useful benefit of dyadic nature of multi-levelwavelet transform for sub-band speech recognition. The wavelet transform is a more dominant technique for speech processingthan other previous techniques. ANN has proved to be the most successful classifiercompared to HMM. 30
  31. 31. MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM www.final-yearprojects.co.cc | www.troubleshoot4free.com/fyp/ 8. REFERENCES[1] Vimal Krishnan V.R, Athulya Jayakumar, Babu Anto.P, “Speech Recognition of Isolated MalayalamWords Using Wavelet Features and Artificial Neural Network”, 4th IEEE International Symposium onElectronic Design, Test & Applications[2] Lawrance Rabiner, Bing-Hwang Juang, “Fundamentals Speech Recognition”, Eaglewood Cliffs, NJ,Prentice hall, 1993.[3] Mallat Stephen, “A Wavelet Tour of Signal Processing”, San Dieago: Academic Press, 1999, ISBN012466606.[4] Mallat SA, “Theory for MuItiresolution Signal Decomposition: The Wavelet Representation”, IEEETransactions on Pattern Analysis Machine Intelligence. Vol. 31, pp 674-693, 1989.[5] K.P. Soman, K.I. Ramachandran, “Insight into Wavelets from Theory to Practice”, Second Edition, PHI,2005.[6] Kadambe S., Srinivasan P. “Application of Adaptive Wavelets for Speech “, Optical Engineering 33(7),pp. 2204-2211, July 1994.[7] Stuart Russel, Peter Norvig, “Artificial Intelligence, A Modern Approach”, New Delhi: Prentice Hall ofIndia, 2005.[8] S.N. Srinivasan, S. Sumathi, S.N. Deepa, “Introduction to Neural Networks using Matlab 6.0,” New Delhi,Tata McGraw Hill, 2006.[9] James A Freeman, David M Skapura, “Neural Networks Algorithm”. Application and ProgrammingTechniques, Pearson Education, 2006. 31

×