Speech Recognition 
Created By : 
Kanjariya Hardik G. 
Roll No : 17
Introduction 
 Speech recognition technology has recently reached a 
higher level of performance and robustness, allowing it 
to communicate to another user by talking . 
 Speech Recognization is process of decoding acoustic 
speech signal captured by microphone or telephone ,to a 
set of words. 
 And with the help of these it will recognize whole 
speech is recognized word by word .
Types of SR 
 There are two main types of speaker models: speaker independent 
and speaker dependent. 
 Speaker independent models recognize the speech patterns of a large 
group of people. 
 Speaker dependent models recognize speech patterns from only one 
person. Both models use mathematical and statistical formulas to yield 
the best work match for speech. A third variation of speaker models is 
now emerging, called speaker adaptive. 
 Speaker adaptive systems usually begin with a speaker independent 
model and adjust these models more closely to each individual during a 
brief training period.
How does it works?.. 
 Speech produces a sound pressure wave which forms an 
acoustic signal. 
The microphone 
– receives the acoustic signal and converts it to an 
analogue signal. 
 To store the analogue signal, it must be converted to a 
digital signal. 
 A speech recognizer tries to transform a digitally 
encoded acoustic signal in a natural language 
into text in that language.
Speech Waveform/Spectrogram 
s p ee ch l a b 
Hz 
 The spectrogram is an alternative way to characterize speech. 
 The louder the sound the greater the amplitude on the y-axis. 
s
Speech Recognition Process 
Flow
The major components 
 Audio input 
 Grammar 
 Acoustic Model 
 Recognized text
Audio I/O 
 It is important to understand that this audio 
stream is rarely pristine 
 It contains not only the speech data (what was 
said) but also background noise. 
 This noise can interfere with the recognition 
process, and the speech engine must handle (and 
possibly even adapt to) the environment within 
which the audio is spoken.
Acoustic+Grammer 
 Once the speech data is in the proper format, the engine 
searches for the best match. 
 It does this by taking into consideration the words and phrases 
it knows about (the active grammars), along with its 
knowledge of the environment in which it is operating. 
 The knowledge of the environment is provided in the form of 
an acoustic model. 
 Once it identifies the most likely match for what was said, it 
returns what it recognized as a text string.
About SR Engine 
 SR requires a software application "engine" with logic 
built in to decipher and act on the spoken word. 
 Sound Card 
– Converts acoustic signal to digital signal. 
 Function of SR Engine- 
– SR Engine converts these digital signal to 
phonemes to word.
 Different SR engine 
 CMU Sphinx 
 Microsoft SAPI 
 IBM ViaVoice
Decoding process.
Recognition Process Flow 
Summary 
Step 1:User Input 
The system catches user’s voice in the form of analog 
acoustic signal. 
Step 2:Digitization 
Digitize the analog acoustic signal. 
Step 3:Phonetic Breakdown 
Breaking signals into phonemes.
Recognition Process Flow 
Summary 
 Step 4:Statistical Modeling 
 Mapping phonemes to their phonetic representation 
using statistics model. 
 Step 5:Matching 
 According to grammar , phonetic representation and 
Dictionary , the system returns an n-best list (I.e.:a 
word plus a confidence score) 
 Grammar-the union words or phrases to constraint the 
range of input or output in the voice application. 
 Dictionary-the mapping table of phonetic 
representation and word(EX:thu,theethe)
REPRESENTATION OF SOFTWARE 
15
Challenges and Difficulties 
of SR 
Speech Recognition is still a very cumbersome problem. Following 
are the problem…. 
 Speaker Variability 
Two speakers or even the same speaker will pronounce the 
same word differently 
 Channel Variability 
The quality and position of microphone and background 
environment will affect the output
Current Software Options for PC 
 Dragon Systems – Naturally Speaking 
 Philips – FreeSpeech 
 IBM – ViaVoice 
 Lernout & Hauspie – Voice Xpress
Speech Recognition
Speech Recognition

Speech Recognition

  • 1.
    Speech Recognition CreatedBy : Kanjariya Hardik G. Roll No : 17
  • 2.
    Introduction  Speechrecognition technology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking .  Speech Recognization is process of decoding acoustic speech signal captured by microphone or telephone ,to a set of words.  And with the help of these it will recognize whole speech is recognized word by word .
  • 3.
    Types of SR  There are two main types of speaker models: speaker independent and speaker dependent.  Speaker independent models recognize the speech patterns of a large group of people.  Speaker dependent models recognize speech patterns from only one person. Both models use mathematical and statistical formulas to yield the best work match for speech. A third variation of speaker models is now emerging, called speaker adaptive.  Speaker adaptive systems usually begin with a speaker independent model and adjust these models more closely to each individual during a brief training period.
  • 4.
    How does itworks?..  Speech produces a sound pressure wave which forms an acoustic signal. The microphone – receives the acoustic signal and converts it to an analogue signal.  To store the analogue signal, it must be converted to a digital signal.  A speech recognizer tries to transform a digitally encoded acoustic signal in a natural language into text in that language.
  • 5.
    Speech Waveform/Spectrogram sp ee ch l a b Hz  The spectrogram is an alternative way to characterize speech.  The louder the sound the greater the amplitude on the y-axis. s
  • 6.
  • 7.
    The major components  Audio input  Grammar  Acoustic Model  Recognized text
  • 8.
    Audio I/O It is important to understand that this audio stream is rarely pristine  It contains not only the speech data (what was said) but also background noise.  This noise can interfere with the recognition process, and the speech engine must handle (and possibly even adapt to) the environment within which the audio is spoken.
  • 9.
    Acoustic+Grammer  Oncethe speech data is in the proper format, the engine searches for the best match.  It does this by taking into consideration the words and phrases it knows about (the active grammars), along with its knowledge of the environment in which it is operating.  The knowledge of the environment is provided in the form of an acoustic model.  Once it identifies the most likely match for what was said, it returns what it recognized as a text string.
  • 10.
    About SR Engine  SR requires a software application "engine" with logic built in to decipher and act on the spoken word.  Sound Card – Converts acoustic signal to digital signal.  Function of SR Engine- – SR Engine converts these digital signal to phonemes to word.
  • 11.
     Different SRengine  CMU Sphinx  Microsoft SAPI  IBM ViaVoice
  • 12.
  • 13.
    Recognition Process Flow Summary Step 1:User Input The system catches user’s voice in the form of analog acoustic signal. Step 2:Digitization Digitize the analog acoustic signal. Step 3:Phonetic Breakdown Breaking signals into phonemes.
  • 14.
    Recognition Process Flow Summary  Step 4:Statistical Modeling  Mapping phonemes to their phonetic representation using statistics model.  Step 5:Matching  According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score)  Grammar-the union words or phrases to constraint the range of input or output in the voice application.  Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)
  • 15.
  • 16.
    Challenges and Difficulties of SR Speech Recognition is still a very cumbersome problem. Following are the problem….  Speaker Variability Two speakers or even the same speaker will pronounce the same word differently  Channel Variability The quality and position of microphone and background environment will affect the output
  • 17.
    Current Software Optionsfor PC  Dragon Systems – Naturally Speaking  Philips – FreeSpeech  IBM – ViaVoice  Lernout & Hauspie – Voice Xpress

Editor's Notes

  • #3 Speech recognition technology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking .
  • #6 The waveform of the utterance “speech lab” shows time in second along the x-axis and the pressure level on the y-axis, the louder the sound the greater the amplitude on the y-axis. The spectrogram is an alternative way to characterize speech. Time is still on the x-axis, but y-axis has frequency (in Hertz) and intensity is shown by the degree of darkness in the image.
  • #15 In step 4 ,there is an internal structure called dictionary