2. Introduction
Speech recognition technology has recently reached a
higher level of performance and robustness, allowing it
to communicate to another user by talking .
Speech Recognization is process of decoding acoustic
speech signal captured by microphone or telephone ,to a
set of words.
And with the help of these it will recognize whole
speech is recognized word by word .
3. Types of SR
There are two main types of speaker models: speaker independent
and speaker dependent.
Speaker independent models recognize the speech patterns of a large
group of people.
Speaker dependent models recognize speech patterns from only one
person. Both models use mathematical and statistical formulas to yield
the best work match for speech. A third variation of speaker models is
now emerging, called speaker adaptive.
Speaker adaptive systems usually begin with a speaker independent
model and adjust these models more closely to each individual during a
brief training period.
4. How does it works?..
Speech produces a sound pressure wave which forms an
acoustic signal.
The microphone
– receives the acoustic signal and converts it to an
analogue signal.
To store the analogue signal, it must be converted to a
digital signal.
A speech recognizer tries to transform a digitally
encoded acoustic signal in a natural language
into text in that language.
5. Speech Waveform/Spectrogram
s p ee ch l a b
Hz
The spectrogram is an alternative way to characterize speech.
The louder the sound the greater the amplitude on the y-axis.
s
8. Audio I/O
It is important to understand that this audio
stream is rarely pristine
It contains not only the speech data (what was
said) but also background noise.
This noise can interfere with the recognition
process, and the speech engine must handle (and
possibly even adapt to) the environment within
which the audio is spoken.
9. Acoustic+Grammer
Once the speech data is in the proper format, the engine
searches for the best match.
It does this by taking into consideration the words and phrases
it knows about (the active grammars), along with its
knowledge of the environment in which it is operating.
The knowledge of the environment is provided in the form of
an acoustic model.
Once it identifies the most likely match for what was said, it
returns what it recognized as a text string.
10. About SR Engine
SR requires a software application "engine" with logic
built in to decipher and act on the spoken word.
Sound Card
– Converts acoustic signal to digital signal.
Function of SR Engine-
– SR Engine converts these digital signal to
phonemes to word.
11. Different SR engine
CMU Sphinx
Microsoft SAPI
IBM ViaVoice
13. Recognition Process Flow
Summary
Step 1:User Input
The system catches user’s voice in the form of analog
acoustic signal.
Step 2:Digitization
Digitize the analog acoustic signal.
Step 3:Phonetic Breakdown
Breaking signals into phonemes.
14. Recognition Process Flow
Summary
Step 4:Statistical Modeling
Mapping phonemes to their phonetic representation
using statistics model.
Step 5:Matching
According to grammar , phonetic representation and
Dictionary , the system returns an n-best list (I.e.:a
word plus a confidence score)
Grammar-the union words or phrases to constraint the
range of input or output in the voice application.
Dictionary-the mapping table of phonetic
representation and word(EX:thu,theethe)
16. Challenges and Difficulties
of SR
Speech Recognition is still a very cumbersome problem. Following
are the problem….
Speaker Variability
Two speakers or even the same speaker will pronounce the
same word differently
Channel Variability
The quality and position of microphone and background
environment will affect the output
17. Current Software Options for PC
Dragon Systems – Naturally Speaking
Philips – FreeSpeech
IBM – ViaVoice
Lernout & Hauspie – Voice Xpress
Editor's Notes
Speech recognition technology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking .
The waveform of the utterance “speech lab” shows time in second along the x-axis and the pressure level on the y-axis, the louder the sound the greater the amplitude on the y-axis.
The spectrogram is an alternative way to characterize speech. Time is still on the x-axis, but y-axis has frequency (in Hertz) and intensity is shown by the degree of darkness in the image.
In step 4 ,there is an internal structure called dictionary