1. An Introduction to Speech
Recognition
Advance Electronic Devices
EC - 410
Instructor: By:
Dr. M Ravibabu Mayank Awasthi (2006033)
2. Topics to be covered
Overview
Speech Production
SR system
Why Speech Recognition is difficult
Current Software Options for PC
Applications
References
3. Overview
Speech is the vocalized form of human communication.
Each spoken word is created out of the phonetic combination of
a limited set of vowel and consonant speech sound units.
Speech recognition is the ability of a machine or program
to identify words and phrases in spoken language and
convert them to a machine-readable format.
Speech recognition has evolved quite a bit over the past few
years. Initially, it used to work in discrete dictation mode, where
you had to pause between each spoken word. Today, however,
it uses continuous dictation. It’s also become smarter, with its
own set of grammar rules to make out the meaning of what’s
being said.
4. Speech Production
Normal human speech is produced with pulmonary pressure
provided by the lungs which creates phonation in the glottis in
the laryngeal prominence that then is modified by the vocal
tract into different vowels and consonants.
Knowledge of generation of various speech sounds help us to
understand the properties of speech sounds.
In short we can say that sound is generated when vocal tract is
excited.
The mode of excitation can be of 3 type:
1). Periodic -------------- in case of vowels
2). Aperiodic ------------ in case of consonants
3). Mixed
5.
6. Contd.
In case of voiced sound as vowels, the excitation is periodic.
The periodic opening & closing of glottis results in puffs of air
exciting vocal tract.
If we assume that 340m/s as the speed of sound in air and
17cm as the length of vocal tract from glottis to lips, the
fundamental frequency of resonance can be calculated as
v=c / w = 34000 / 4*17= 500hz
The frequencies of the harmonic would be 1500hz, 2500hz etc.
Thus we should expect peaks in the frequency spectrum of the
vowel at these frequencies.
These peaks in the spectrum, due to resonance in the
vocal tract is called Formants
Different speech sources are generated by changing the
resonant cavity resulting in the different value of frequency,
amplitude and bandwidth of formants.
7. Contd.
Source excitation Time varying filter Output Speech wave
representing vocal tract
Source Filter Model of speech production
• We know that s(n)= e(n)*h(n)
• Figure shows that typical spectra of two speech sounds of the
hindi word “ki” on log scale. Red one for ‘/i/’ and black one for ‘/k/’
8.
9. Contd.
Speech sounds are characterized by the size and shape of filter (vocal
cavity) which is represented by the spectrum of the filter H(k).
Therefore, the source characteristic such as fundamental frequency,
signal amplitude etc. can be ignored in speech recognition.
The log power spectrum of the is the sum of the log power spectrum of
source and filter.Since the power spectrum of source is varying rapidly
with frequency whereas the filter varies slowly. Therefore if we pass
this composite log power spectrum through a low pass, only the
characteristic of the filter remains.
This process is called Liftering & can be achieved by just taking the
inverse fourier transform of log power spectrum and retaining first few
components. The resulting spectrum is called cepstrum and
coefficient is called cepstral coefficients.
cep(q)= IFFT{ log(|S(k)|2)} q=0,1,2,……N-1.
Most of the SR system use cepstral coefficients and their time
derivatives as feature for representing speech sounds
11. Contd.
First, the user gives a voice command over the microphone, which is
passed to the sound card in your system. This analog signal is
sampled converted into digital form using a technique called Pulse
Code Modulation or PCM. This digital waveform is a stream of
amplitudes that look like a wavy line.
The audio signal is further sampled and each sample is converted into
a frequency domain. So, the incoming stream is now a set of discrete
frequency bands, in a form that can be used by the speech recognizer.
The next stage involves recognizing these bands of frequencies.
For this, the speech recognition software has a database containing
thousands of frequencies or "phonemes", as they’re called.
12. Contd.
A phoneme is the smallest unit of speech in a language. The
utterance (vocalization) of one phoneme is different from another, such
that if one phoneme replaces another in a word, the word would have
a different meaning. For example, if the "b" in "bat" were replaced by
the phoneme "r", the meaning would change to "rat".
Ex: Kit vs Skill. /k/ is aspirated in first case & not in second case.
The phoneme database is used to match the audio frequency
bands that were sampled. So, for example, if the incoming frequency
sounds like a "t", the software will try and match it to the corresponding
phoneme in the database. Each phoneme is tagged with a feature
number, which is then assigned to the incoming signal.
13. Why SR is difficult?
A given word is spoken by different persons, different persons
have different spectral properties. Ex- Female had shorter vocal
tract than male. So the formant frequency spoken by female is
higher than that of spoken by male.
The properties of the sound not only depend on the identity of
the corresponding phoneme but also on the neighbouring
sound. Ex- a speaker has mispronounced the long word
“Thiruvananthapuran” as “tiruvanthpuram. Human being don’t
have any problem in translating it to correct word.
However such case pose a problem for machine.
14. Current Software Options for PC
Dragon Systems – Naturally Speaking
Philips – FreeSpeech
IBM – ViaVoice
Lernout & Hauspie – Voice Xpress
15. Applications
Military: On particular note are the U.S. programs in speech
recognition for the Advanced Fighter Technology Integration
(AFTI)/F-16 aircraft (F-16 VISTA), the program in France on
installing speech recognition systems on Mirage aircraft.In
these programs, speech recognizers have been operated
successfully in fighter aircraft with applications including: setting
radio frequencies, commanding an autopilot system, setting
steer-point coordinates and weapons release parameters, and
controlling flight displays.
Person with disabilities
Telephony and other domains