2. Course Objectives:
1. To enable the students to learn the fundamentals and
classification of speech sounds.
2. To make the students to analyze and compare different
speech parameters using various methods.
3. To equip the students with various speech modelling
techniques.
4. To enable the students to acquire knowledge on various
speech recognition systems.
5. To gain knowledge about the various methods used for the
process of speech synthesis.
2
AAZHAGUJAISUDHANRITECE
3. Course Outcomes:
After completion of the course, it is expected that:
The students will be able to
1. Explain the fundamentals and classification of speech sounds.
2. Analyse, Extract and compare the various speech parameters.
3. Apply an appropriate speech model for a given application.
4. Explain the various speech recognition systems.
5. Apply different speech synthesis techniques depending upon
the classification of speech parameters.
3
AAZHAGUJAISUDHANRITECE
4. UNIT I BASIC CONCEPTS
Speech Fundamentals: Articulatory Phonetics – Production and Classification of Speech
Sounds; Acoustic Phonetics – Acoustics of speech production; Review of Digital Signal
Processing concepts; Short-Time Fourier Transform, Filter-Bank and LPC Methods.
UNIT II SPEECH ANALYSIS
Features, Feature Extraction and Pattern Comparison Techniques: Speech distortion
measures– mathematical and perceptual – Log–Spectral Distance, Cepstral Distances,
Weighted Cepstral Distances and Filtering, Likelihood Distortions, Spectral Distortion using
a Warped Frequency Scale, LPC, PLP and MFCC Coefficients, Time Alignment and
Normalization – Dynamic Time Warping, Multiple Time – Alignment Paths.
UNIT III SPEECH MODELING
Hidden Markov Models: Markov Processes, HMMs – Evaluation, Optimal State Sequence –
Viterbi Search, Baum-Welch Parameter Re-estimation, Implementation issues.
UNIT IV SPEECH RECOGNITION
Large Vocabulary Continuous Speech Recognition: Architecture of a large vocabulary
continuous speech recognition system – acoustics and language models – n-grams, context
dependent sub-word units; Applications and present status.
UNIT V SPEECH SYNTHESIS
Text-to-Speech Synthesis: Concatenative and waveform synthesis methods, sub-word units
for TTS, intelligibility and naturalness – role of prosody, Applications and present status.
4
AAZHAGUJAISUDHANRITECE
5. INTRODUCTION
Speech processing is the study of speech signals and the
processing methods of these signals.
Speech Processing is the application of DSP techniques to the
processing and/or analysis of speech signals.
5
AAZHAGUJAISUDHANRITECE
6. Speech is the most natural form of human-
human communications.
Speech is related to language; linguistics is a
branch of social science.
Speech is related to human physiological
capability; physiology is a branch of medical
science.
Speech is also related to sound and acoustics, a
branch of physical science.
Therefore, speech is one of the most intriguing
signals that humans work with every day.
6
AAZHAGUJAISUDHANRITECE
9. Speech coding: Compression of speech signals for
telecommunication
Speech recognition: Extracting the linguistic content of
the speech signal
Speaker recognition: Recognizing the identity of
speakers by their voice
Speech synthesis: Computer generated speech
(e.g., from text)
Speech enhancement: Improving intelligibility or
perceptual quality of speech signal
9
AAZHAGUJAISUDHANRITECE
10. APPLICATIONS
Translation of spoken language into text by
computers
Voice user interfaces such as voice dialing
(Call home)
Speech to text processing (Word processors or
emails)
Recognizing the speaker
10
AAZHAGUJAISUDHANRITECE
14. Speech Generation
•The production process (generation) begins when the talker
formulates a message in his mind which he wants to transmit to
the listener via speech.
•In case of machine
•First step: message formation in terms of printed text.
•Next step: conversion of the message into a language code.
•After the language code is chosen the talker must execute a
series of neuromuscular commands to cause the vocal cord to
vibrate such that the proper sequence of speech sounds is created.
•The neuromuscular commands must simultaneously control the
movement of lips, jaw, tongue, and velum.
14
AAZHAGUJAISUDHANRITECE
15. SPEECH PERCEPTION
The speech signal is generated and propagated to the
listener, the speech perception (recognition) process
begins.
First the listener processes the acoustic signal along
the basilar membrane in the inner ear, which
provides running spectral analysis of the incoming
signal.
A neural transduction process converts the spectral
signal into activity signals on the auditory nerve.
Finally the message comprehension (understanding of
meaning) is achieved.
15
AAZHAGUJAISUDHANRITECE
16. SOUND PERCEPTION
The audible frequency range for human is
approximately 20Hz to 20KHz
The three distinct parts of the ear are outer ear,
middle ear and inner ear.
Outer ear:
The perceived sound is sensitive to the pinna’s shape
By changing the pinnas shape the sound quality
alters as well as background noise
After passing through ear cannal sound wave strikes
the eardrum which is part of middle ear.
16
AAZHAGUJAISUDHANRITECE
19. MIDDLE EAR
EAR DRUM
This oscillates with the frequency as that of the
sound wave
Movements of this membrane are then transmitted
through the system of small bones called as ossicular
system
From ossicular system to cochlea.
Inner ear
It consist of two membranes Reissner’s membrane and
basilar membrane
When vibrations enter cochlea they stimulate 20,000 to
30,000 stiff hairs on the basilar membrane
These hair in turn vibrate and generate electrical signal
that travel to the brain and become sound 19
AAZHAGUJAISUDHANRITECE
20. PHONEME HIERARCHY
Speech sounds
Vowels ConsonantsDiphtongs
Plosive
Nasal
Fricative
Retroflex
liquid
Lateral
liquid
Glide
iy, ih, ae, aa,
ah, ao,ax, eh,
er, ow, uh, uw
ay, ey,
oy, aw
w, y
p, b, t,
d, k, g
m, n, ng f, v, th, dh,
s, z, sh, zh, h
r
l
Language dependent.
About 50 in English.
20
AAZHAGUJAISUDHANRITECE
24. VOWELS
Vowels are produced by exciting an essentially fixed
vocal tract shape with quasi periodic pulses of air caused
by the vibration of the vocal cords.
A speech sound produced by humans when the breath
flows out through the mouth without being blocked by
the teeth, tongue, or lips
A short vowel is a short sound as in the word "cup"
A long vowel is a long sound as in the word "shoe"
24
AAZHAGUJAISUDHANRITECE
27. WHY VOWELS ARE EASILY
DECODABLE?
Vowels are generally long in duration as
compared to consonants.
Spectrally well defined
Vowels are easily and reliably recognized by both
human and machine.
Vowels can be subdivided into three sub groups
based on tongue hump being along the front,
central and back part of the palate. 27
AAZHAGUJAISUDHANRITECE
28. VOWELS
For the vowel /i/ - eve, beat- the vocal tract is
open at the back, the tongue is raised at the front
and there is a high degree of constriction of the
tongue against the palate
For the vowel /a/ - father, bob - the vocal tract is
open at the front, the tongue is raised at the back
and there is a low degree of constriction by the
tongue against the palate
28
AAZHAGUJAISUDHANRITECE
29. i – IY - beat, eve
I – IH – bit
e – EH – bet, hate 29
AAZHAGUJAISUDHANRITECE
30. a – AA - Bob
- AH- butə
30
AAZHAGUJAISUDHANRITECE
31. u – UW - boot
U – UH –book
O – OW -boat
31
AAZHAGUJAISUDHANRITECE
33. DIPHTONGS
Diphthongs is a gliding monosyllabic speech sound
that starts at or near the articulatory position for
one vowel and moves to or toward the position for
another.
According to this there are six diphthongs in
American english.
Examples : Buy, boy, down, bait
33
AAZHAGUJAISUDHANRITECE
34. DIPHTONGS
A vowel sound in which the tongue changes position to
produce the sound of two vowels
A sound formed by the combination of two vowels in a
single syllable
34
AAZHAGUJAISUDHANRITECE
35. SEMIVOWELS
Groups of sound consisting of /w/ - W - Wit
/l/ - L – Let, /r/ - R – rent is quite difficult to
characterize.
These sounds are called semivowels because of
their vowel like nature.
It is characterized by a gliding transition in vocal
tract area function between adjacent phonemes.
35
AAZHAGUJAISUDHANRITECE
36. LIQUIDS
Liquids is a consonant produced when the tongue approaches a
point of articulation within the mouth but does not come close
enough to obstruct or constrict the flow of air enough to create
turbulence (as with fricatives).
The primary difference between liquids and glides is that with a
liquid, the tip of the tongue is used, whereas with glides, body of
the tongue is used and not the tip is raised.
/w/ - W - Wit
/l/ - L - Let
36
AAZHAGUJAISUDHANRITECE
37. GLIDES
To move easily without stopping and without effort or noise
Glides – like a liquid, is a consonant produced when the tongue
approaches a point of articulation within the mouth but does not
come close enough to obstruct or constrict the flow of air enough to
create turbulence.
Unlike nasals, the flow of air is not redirected into the nose. Instead,
as with liquids, the air is still allowed to escape via the mouth.
/r/ - R - Rent
37
AAZHAGUJAISUDHANRITECE
38. CONSONANTS
One of the speech sounds or letters of the alphabet that is not a
vowel
Consonants are pronounced by stopping the air from flowing
easily through the mouth, especially by closing the lips or
touching the teeth with the tongue
A nasal consonant is one in which air escapes only through the
nose
In English, "m" and "n" are nasal consonants
In hat, H and T are consonants.
m(me), n(no), G(sing)
38
AAZHAGUJAISUDHANRITECE
39. NASAL CONSONANTS
Nasals – a nasal is a consonant produced by redirecting out air
through the nose instead of allowing it to escape out of the mouth.
Nasal consonants are /m/ - EM -bottom /n/ - EN -button, are
produced with glottal excitation and vocal tract totally constricted at
some point along the oral passageway.
The velum is lowered so that air flows through the nasal tract with
sound being radiated through the nostrils
/m/ - constriction at the lips
/n/ - constriction is just behind the teeth
39
AAZHAGUJAISUDHANRITECE
42. UNVOICED FRICATIVES
Produced by exciting the vocal tract by a steady air
flow
Becomes turbulent in the region of a constriction in
the vocal tract
Location of the constriction determines the fricative
sound
/f/-Constriction is near the lips
/θ/- Constriction is near the teeth
/s/-Constriction is near the middle of the oral tract
/sh/-Constriction is near the back of the oral tract
Vocal tract is separated into two cavities by the
source of noise at the constriction 42
AAZHAGUJAISUDHANRITECE
43. VOICED FRICATIVES
/v/,/z/ and /zh/ are some of the examples of voiced
fricatives
The place of constriction for each of the
corresponding phoneme is essentially identical
Vocal cords will vibrate
There is only one excitation source i.e glottis
Eg: vat, assure
43
AAZHAGUJAISUDHANRITECE
44. STOPS/PLOSIVES
Produced by completely stopping the air flow
Airstream cannot escape through the mouth
44
AAZHAGUJAISUDHANRITECE
46. VOICED STOPS
These are transient, non continuant sounds
produced by building up pressure behind a total
constriction somewhere in the oral tract and then
suddenly releasing the pressure
/b/- Constriction is at the lips
/d/- Constriction is at the back of the teeth
/g/- Constriction is near the velum
No sound is radiated from the lips
Vocal cords will vibrate
Their properties are highly influenced by the
vowel that follows the stop consonant. 46
AAZHAGUJAISUDHANRITECE
47. UNVOICED STOPS
/p/,/t/ and /k/ are some examples
The vocal cords do not vibrate
47
AAZHAGUJAISUDHANRITECE
49. WHISPERS
Vocal cords are not vibrate
Air passes between the arytenoid cartilages to create
audible turbulence during speech
To convey secret information without being overheard
or to avoid disturbing others in a quiet place such as
a library or place of worship
49
AAZHAGUJAISUDHANRITECE
51. APPROACHES TO AUTOMATIC
SPEECH RECOGNITION BY
MACHINE
There are three approaches
The acoustic-phonetic approach
The pattern recognition approach
The artificial intelligence approach
51
AAZHAGUJAISUDHANRITECE
53. SEGMENTATION AND LABELLING
Segmenting the speech signal into discrete regions
depending on the acoustic properties of the signal
Attaching one or more phonetic labels to each
segmented region
Second step attempts to determine a valid word from
the sequence of phonetic labels.
The problem is to decode the phoneme lattice in to a
word string such that every instant of time is
included in one of the phonemes in the lattice.
53
AAZHAGUJAISUDHANRITECE
54. One phoneme can be pronounced in different ways,
therefore a phone group containing similar variants
of a single phoneme is called an allphone
The symbol SIL -Silence
SIL – AO – L – AX – B – AW – T - “all
about”
Lattice structure refer page. No. 38
L , AX and B corresponding to second and
third choices in the lattice. 54
AAZHAGUJAISUDHANRITECE
55. PROBLEMS IN ACOUSTIC PHONETIC
APPROACH
The method requires extensive knowledge of the
acoustic properties of phonetic units
For most systems the choice of features is based on
intuition and is not optimal in a well defined and
meaningful sense
The design of sound classifiers is also not optimal
No well-defined, automatic procedure exists for tuning
the method on real, labeled speech.
55
AAZHAGUJAISUDHANRITECE
56. PATTERN RECOGNITION APPROACH
Speech pattern are used directly without explicit
feature determination and segmentation.
Step one: training of speech patterns
Step two: recognition of pattern via pattern
comparison
56
AAZHAGUJAISUDHANRITECE
57. PATTERN RECOGNITION APPROACH
Speech knowledge is brought in to the system via
the training procedure.
Enough version of a pattern to be recognized are
included in a training set provided to the algorithm.
Machine learns which acoustic properties of the
speech class are reliable and repeatable across all
training token of the pattern.
57
AAZHAGUJAISUDHANRITECE
58. ADVANTAGES
Simplicity of use – Mathematical representation is
easy
Robustness and invariance to different speech
vocabularies, users, feature sets , pattern comparison
algorithms and decision rule.
Proven high performance
58
AAZHAGUJAISUDHANRITECE
59. ARTIFICIAL INTELLIGENCE
APPROACH
It is the hybrid of acoustic phonetic and pattern
recognition approach
This approach recognition procedure according to the
way a person applies its intelligence in visualizing
analyzing and finally making a decision on the
measure acoustic feature
Neural network - For learning the relationship
between phonetic events and all known inputs as well
as for discrimination between similar sound classes.
59
AAZHAGUJAISUDHANRITECE
61. SPEECH ANALYSIS SYSTEM
It provide an appropriate spectral representation
of the time varying speech signal
Technique used is linear predictive coding (LPC)
method
61
AAZHAGUJAISUDHANRITECE
62. FEATURE DETECTION STAGE
Convert the spectral measurements to a set of features
that describe the acoustic properties of the different
phonetic units.
Features
Nasality: presence or absence of nasal resonance
Friction: presence or absence of random excitation in the
speech
Formant locations: frequencies of the first three resonances
Voiced and unvoiced classification: periodic and aperiodic
excitation 62
AAZHAGUJAISUDHANRITECE
63. SEGMENTATION AND LABELLING
PHASE
The system tries to find stable regions
To label the segmented region according to how well the
features within that region match those of individual
phonetic units
This stage is the heart of the acoustic-phonetics
recognizer and is the most difficult one to carry out
reliably
various control strategies are used to limit the range of
segmentation points and label possibilities
The final output of the recognizer is the word or word
sequence, in some well-defined sense.
63
AAZHAGUJAISUDHANRITECE
64. VOWEL CLASSIFIER
Formant - bands of frequency that determine the
phonetic quality of a vowel.
Compact sounds have a concentration of energy in the
middle of the frequency range of the spectrum. An
example is the vowel ɑ which has a relatively high
first formant which is close to the frequency of the
second formant.
The opposite of compact is diffuse. A diffuse vowel,
such as i, has no centrally located concentration of
energy – the first and second formants are widely
separated.
64
AAZHAGUJAISUDHANRITECE
66. ACUTE AND GRAVE
"acute" typically refers to front vowels
Grave typically refers to back vowels
66
AAZHAGUJAISUDHANRITECE
67. Three features have been detected over the segment,
first formant, F1, second formant, F2, and duration of
the segment, D.
The first test separates vowels with low F1 from
vowels with high F1.
Each of these subsets can be split further on the basis
of F2 measurement with high F2 and low F2.
The third test is based on segment duration, which
separates tense vowels (large value of D) from lax
vowels (small values of D).
Finally, a finer test on formant values separates the
remaining unresolved vowels, resolving the vowels
into flat vowels and plain vowels. 67
AAZHAGUJAISUDHANRITECE
70. Feature measurement: in which sequence of measurement is made
on the input signal to define “test pattern”.
The feature measurements are usually the o/p of spectral analysis
technique, such as filter bank analyzer, a LPC, or a DFT analysis
Pattern training: creates a reference pattern (for different sound
class) called as Template
Pattern classification: unknown test pattern is compared with each
(sound) class reference pattern and a measure of distance between
the test pattern and each reference pattern is computed.
Decision logic: the reference pattern similarity (distance) scores are
used to decide which reference pattern best matches the unknown
test pattern.
70
AAZHAGUJAISUDHANRITECE
71. STRENGTHS AND WEAKNESS OF
THE PATTERN-RECOGNITION
MODEL
The performance of the system is sensitive to the
amount of training data available for creating sound
class reference pattern.( more training, higher
performance)
The reference patterns are sensitive to the speaking
environment and transmission characteristics of the
medium used to create the speech. (because speech
spectral characteristics are affected by transmission
and background noise)
71
AAZHAGUJAISUDHANRITECE
72. The method is relatively insensitive syntax and
semantics.
The system is insensitive to the sound class. So
the techniques are applied to wide range of
speech sounds (phrases).
72
AAZHAGUJAISUDHANRITECE
73. AI APPROACHES TO SPEECH
RECOGNITION
The basic idea of AI is to compile and incorporate the
knowledge from variety of knowledge sources to solve
the problem.
Acoustic Knowledge: Knowledge related to sound or
sense of hearing
Lexical Knowledge: Knowledge of the words of the
language. (decomposing words into sounds)
Syntactic Knowledge: Knowledge of syntax (rules)
Semantic Knowledge: Knowledge of the meaning of the
language.
Pragmatic Knowledge: (sense derived from meaning)
inference ability necessary in resolving ambiguity of
meaning based on ways in which words are generally
used.
73
AAZHAGUJAISUDHANRITECE
74. SEVERAL WAYS TO INTEGRATE
KNOWLEDGE SOURCES WITHIN A
SPEECH RECOGNIZER
1. Bottom-Up
2. Top-Down
3. Black Board
74
AAZHAGUJAISUDHANRITECE
75. BOTTOM-UP” APPROACH:
The lowest level processes (feature detection,
phonetic decoding) precede higher level processes
(lexical coding) in a sequential manner.
75
AAZHAGUJAISUDHANRITECE
76. TOP-UP” APPROACH:
In this the language model generate word hypotheses that
are matched against the speech signal, and syntactically
and semantically meaningful sentences are built up on
the basis of word match scores.
76
AAZHAGUJAISUDHANRITECE
78. 78
Introduction
• Spectral analysis is the process of
defining the speech in different
parameters for further processing
• Eg short term energy, zero crossing
rates, level crossing rates and so on
• Methods for spectral analysis are
therefore considered as core of the
signal processing front end in a speech
recognition system
A AZHAGUJAISUDHAN RIT
ECE
81. 82
Pattern recognition Model
• The three basic steps in pattern recognition
model are
– 1. parameter measurement
– 2. pattern comparison
– 3. decision making
A AZHAGUJAISUDHAN RIT
ECE
82. 83
1. Parameter measurement
• To represent the relevant acoustic events in
speech signal in terms of compact efficient
set of speech parameters
• The choice of which parameters to use is
dictated by other consideration
• eg
– computational efficiency,
– type of Implementation ,
– available memory
• The way in which representation is computed
is based on signal processing considerations
A AZHAGUJAISUDHAN RIT
ECE
84. 85
Spectral Analysis
• Two methods:
– The Filter Bank spectrum
– The Linear Predictive coding (LPC)
A AZHAGUJAISUDHAN RIT
ECE
85. 86
The Filter Bank spectrum
Digital i/p
Spectral representation
The band pass filters coverage spans the frequency range of interest in the signal
A AZHAGUJAISUDHAN RIT
ECE
86. 87
1.The Bank of Filters Front end
Processor
• One of the most common approaches
for processing the speech signal is the
bank-of-filters model
• This method takes a speech signal as
input and passes it through a set of
filters in order to obtain the spectral
representation of each frequency band
of interest.
A AZHAGUJAISUDHAN RIT
ECE
87. 88
• Eg
• 100-3000 Hz for telephone quality
signal
• 100-8000 Hz for broadband signal
• The individual filters generally do
overlap in frequency
• The output of the ith bandpass filter
• where Wi is the normalized frequency
A AZHAGUJAISUDHAN RIT
ECE
88. 89
• Each bandpass filter processes the
speech signal independently to produce
the spectral representation Xn
A AZHAGUJAISUDHAN RIT
ECE
89. 90
The Bank of Filters Front end
Processor
A AZHAGUJAISUDHAN RIT
ECE
90. 91
The Bank of Filters Front end
Processor
∑
−
=
−=
≤≤=
1
0
)()(
Qi1,)(*)()(
iM
m
i
ii
mnsmh
nhnsns
The sampled speech signal, s(n), is passed
through a bank of Q Band pass filters,
giving the signals
A AZHAGUJAISUDHAN RIT
ECE
91. 92
The Bank of Filters Front end
Processor
The bank-of-filters approach obtains the
energy value of the speech signal
considering the following steps:
• Signal enhancement and noise
elimination.- To make the speech signal
more evident to the bank of filters.
• Set of bandpass filters.- Separate the
signal in frequency bands. (uniform/non
uniform filters )
A AZHAGUJAISUDHAN RIT
ECE
92. 93
• Nonlinearity.- The filtered signal at
every band is passed through a non
linear function (for example a wave
rectifier full wave or half wave) for
shifting the bandpass spectrum to the
low-frequency band.
A AZHAGUJAISUDHAN RIT
ECE
93. 94
The Bank of Filters Front end
Processor
• Low pass filter.- This filter eliminates
the high-frequency generated by the
non linear function.
• Sampling rate reduction and amplitude
compression.- The resulting signals are
now represented in a more economic way
by re-sampling with a reduced rate and
compressing the signal dynamic range.
The role of the final lowpass filter is to eliminate the undesired spectral peaks
A AZHAGUJAISUDHAN RIT
ECE
94. 95
The Bank of Filters Front end
Processor
)sin()( nns iii ωα=
Assume that the output of the ith
bandpass filter is a pure
sinusoid at frequency ωI
If full wave rectifier is used as the nonlinearity
<
≥+
=
==
<−=
≥=
0(n)sif1-
0(n)sif1
)(
where
)().())((
:outputtynonlineariThe
0(n)sfor)(
0(n)sfor)())((s
i
i
i
ii
nw
nwnsnsfv
ns
nsnf
iii
i
i
A AZHAGUJAISUDHAN RIT
ECE
95. 97
Types of Filter Bank Used For
Speech Recognition
• uniform filter bank
• Non uniform filter bank
A AZHAGUJAISUDHAN RIT
ECE
96. 98
uniform filter bank
• The most common filter bank is the uniform filter
bank
• The center frequency, fi, of the ith
bandpass filter is
defined as
• Q is number of filters used in bank of filters
speech.theofrangefrequencyspan the
torequiredfiltersspaceduniformlyofnumbertheisN
signalspeechtheofratesamplingtheisFswhere
Qi1, ≤≤= i
N
Fs
fi
A AZHAGUJAISUDHAN RIT
ECE
97. 99
uniform filter bank
• The actual number of filters used in the
filter bank
• bi is the bandwidth of the ith
filter
• There should not be any frequency overlap
between adjacent filter channels
2/NQ ≤
A AZHAGUJAISUDHAN RIT
ECE
98. 100
uniform filter bank
If bi < Fs/N, then the certain portions
of the speech spectrum would be
missing from the analysis and the
resulting speech spectrum would not be
considered very meaningful
A AZHAGUJAISUDHAN RIT
ECE
99. 101
nonuniform filter bank
• Alternative to uniform filter bank is
nonuniform filter bank
• The criterion is to space the filters
uniformly along a logarithmic
frequency scale.
• For a set of Q bandpass filters with
center frequncies fi and bandwidths
bi, 1≤i≤Q, we set
A AZHAGUJAISUDHAN RIT
ECE
101. 103
Implementations of Filter
Banks
• Depending on the method of designing
the filter bank can be implemented in
various ways.
• Design methods for digital filters fall
into two classes:
– Infinite impulse response (IIR)
(recursive filters)
– Finite impulse response
A AZHAGUJAISUDHAN RIT
ECE
102. 104
The FIR filter: (finite impulse response)
or non recursive filter
• The present output is depend on the
present input sample and previous input
samples
• The impulse response is restricted to
finite number of samples
A AZHAGUJAISUDHAN RIT
ECE
103. 105
• Advantages:
– Stable, noise less sever
– Excellent design methods are available for
various kinds of FIR filters
– Phase response is linear
• Disadvantage:
– Costly to implement
– Memory requirement and execution time
are high
– Require powerful computational facilities
A AZHAGUJAISUDHAN RIT
ECE
104. 106
The IIR filter: (Infinite impulse
response) or recursive filter
• The present output sample is depends
on the present input, past input samples
and output samples
• The impulse response extends over an
infinite duration
A AZHAGUJAISUDHAN RIT
ECE
105. 107
• Advantage:
– Simple to design
– Efficient
• Disadvantage:
– Phase response is non linear
– Noise affects more
– Not stable
A AZHAGUJAISUDHAN RIT
ECE
107. 109
FIR Filters
• Less expensive implementation can be
derived by representing each bandpass
filter by a fixed low pass window ω(n)
modulated by the complex exponential
fiwnseS
eSne
emnmse
emnms
mnsemnx
ennh
i
jw
n
jwnjw
mjw
m
njw
mnjw
m
m
njw
i
njw
i
i
ii
ii
i
i
i
Π=
=
−=
−=
−=
=
−
−
∑
∑
∑
2at)(ofansformFourier trtheis)(where
)(
)()(
)()(
)()()(
)()(
)(
ω
ω
ω
ω
A AZHAGUJAISUDHAN RIT
ECE
110. 112
Frequency Domain Interpretation
For Short Term Fourier
Transform
mjw
m
jw ii
emnmseSn −
∑ −= )()()( ω
At n=n0
i
jw
mnmsFTeSn i
ωωω =−= |)]()([)( 00
Where FT[.] denotes Fourier Transform
Sn0(ejωi
) is the conventional Fourier transform
of the windowed signal, s(m)w(n0-m), evaluated
at the frequency ω= ωi
A
A AZHAGUJAISUDHAN RIT
ECE
111. 113
Frequency Domain Interpretation
For Short Term Fourier
Transform
Shows which part of s(m) are used in the computation of
the short time Fourier transform
A AZHAGUJAISUDHAN RIT
ECE
112. 114
Frequency Domain Interpretation
For Short Term Fourier
Transform
• Since w(m) is an FIR filter with size L
then from the definition of Sn(ejωi
) we
can state that
– If L is large, relative to the signal
periodicity then Sn(ejωi
) gives good
frequency resolution
– If L is small, relative to the signal
periodicity then Sn(ejωi
) gives poor
frequency resolutionA AZHAGUJAISUDHAN RIT
ECE
113. 115
Frequency Domain Interpretation
For Short Term Fourier
Transform
For L=500 points Hamming
window is applied to a
section of voiced speech.
The periodicity of the signal
is seen in the windowed time
waveform as well as in the
short time spectrum in which
the fundamental frequency
and its harmonics show up as
narrow peaks at equally
spaced frequencies.
A AZHAGUJAISUDHAN RIT
ECE
114. 116
Frequency Domain Interpretation
For Short Term Fourier
Transform
For short windows, the time
sequence s(m)w(n-m) doesn’t
show the signal periodicity, nor
does the signal spectrum.
It shows the broad spectral
envelop very well.
A AZHAGUJAISUDHAN RIT
ECE
115. 117
Frequency Domain Interpretation
For Short Term Fourier
Transform
Shows irregular series of local
peaks and valleys due to the
random nature of the unvoiced
speech
A AZHAGUJAISUDHAN RIT
ECE
116. 118
Frequency Domain Interpretation
For Short Term Fourier
Transform
Using the shorter window
smoothes out the random
fluctuations in the short time
spectral magnitude and
shows the broad spectral
envelope very well
A AZHAGUJAISUDHAN RIT
ECE
117. 119
Linear Filtering Interpretation of
the short-time Fourier
Transform
• The linear filtering interpretation of
the short time Fourier Transform
• i.e Sn(ejwi
) is a convolution of the low
pass window, w(n), with the speech
signal, s(n), modulated to the center
frequency wi
)()()( nenseSn njwjw ii
ωΘ= −
* From A
A AZHAGUJAISUDHAN RIT
ECE
120. 122
Summary of considerations for
speech recognition filter banks
1st
. Type of digital filter used (IIR
(recursive) or FIR (nonrecursive))
• IIR: Advantage: simple to implement and
efficient.
Disadvantage: phase response is nonlinear
• FIR: Advantage: phase response is linear
Disadvantage: expensive in implementation
A AZHAGUJAISUDHAN RIT
ECE
121. 123
Summary of considerations for
speech recognition filter banks
2nd
. The number of filters to be used in the
filter bank.
1. For uniform filter banks the number of
filters, Q, can not be too small or else the
ability of the filter bank to resolve the
speech spectrum is greatly damaged. The
value of Q less than 8 are generally avoided
2. The value of Q can not be too large, because
the filter bandwidths would eventually be
too narrow for some talker (eg. High-pitch
females) i.e no prominent harmonics would
fall within the band. (in practical systems
the value of Q≤32).
A AZHAGUJAISUDHAN RIT
ECE
122. 124
Summary of considerations for
speech recognition filter banks
In order to reduce overall computation,
many practical systems have used
nonuniform spaced filter banks
A AZHAGUJAISUDHAN RIT
ECE
123. 125
Summary of considerations for
speech recognition filter banks
3rd
. The choice of nonlinearity and LPF
used at the output of each channel
• Nonlinearity: Full wave or Half wave
rectifier
• LPF: varies from simple integrator to a
good quality IIR lowpass filter.
A AZHAGUJAISUDHAN RIT
ECE
126. LINEAR PREDICTIVE CODING MODEL
FOR SPEECH RECOGNITION
LPC provides a good model of the speech signal.
Voiced region – good approximation
Unvoiced region - less effective than for voiced region
LPC is an analytically tractable model. The method of
LPC is mathematically precise and is simple and
straightforward to implement in either software or
hardware.
Computation required in LPC processing is
considerably less that that required for an all digital
implementation of the bank of filters
LPC model works well in recognition application. 128
AAZHAGUJAISUDHANRITECE
127. 129
3.3.1 The LPC Model3.3.1 The LPC Model
.
)(
1
1
1
)(
)(
)(
)()()(
,)()()(
),(...)2()1()(
1
1
1
21
zA
za
zGU
zS
zH
zGUzSzazS
nGuinsans
pnsansansans
p
i
i
i
p
i
i
i
p
i
i
p
=
−
==
+=
+−=
−++−+−≈
∑
∑
∑
=
−
=
−
=
Convert this to equality by including an excitation term:
A AZHAGUJAISUDHAN RIT ECE
128. 130
3.3.2 LPC Analysis Equations3.3.2 LPC Analysis Equations
.1
)(
)(
)(
)()()()()(
).()(
).()()(
1
1
~
1
~
1
∑
∑
∑
∑
=
−
=
=
=
−==
−−=−=
−=
+−=
p
k
k
k
p
k
k
p
k
k
p
k
k
za
zS
zE
zA
knsansnsnsne
knsans
nGuknsans
The prediction error:
Error transfer function:
A AZHAGUJAISUDHAN RIT ECE
129. 131
3.3 LINEAR PREDICTIVE CODING3.3 LINEAR PREDICTIVE CODING
MODEL FOR SPEECHMODEL FOR SPEECH
RECOGNITIONRECOGNITION
)(nu
G
)(/1 zA
)(ns
⊗
A AZHAGUJAISUDHAN RIT ECE
130. 132
Linear Prediction Model:Linear Prediction Model:
Using LP analysis :Using LP analysis :
DT
Impulse
generator
White
Noise
generator
Time varying
Digital
Filter
Voiced
Unvoiced
Pitch
Gain
estimate
V
U
Vocal Tract
Parameters
s(n)
Speech
Signal
A AZHAGUJAISUDHAN RIT ECE
131. The basic problem of linear prediction analysis isThe basic problem of linear prediction analysis is
to determine the set of predictor coefficientsto determine the set of predictor coefficients
Spectral characteristics of speech vary over timeSpectral characteristics of speech vary over time
the predictor coefficients at a given time n mustthe predictor coefficients at a given time n must
be estimated from a short segment of speechbe estimated from a short segment of speech
signalsignal
Short time spectral analysis is performed onShort time spectral analysis is performed on
successive frames of speech, with frame spacingsuccessive frames of speech, with frame spacing
on the order of 10msec.on the order of 10msec.
133
ka
A AZHAGUJAISUDHAN RIT ECE
132. 134
3.3.2 LPC Analysis Equations3.3.2 LPC Analysis Equations
.)()(
)(
)()(
)()(
2
1
2
∑ ∑
∑
−−=
=
+=
+=
=m
p
k
nknn
m
nn
n
n
kmsamsE
meE
mneme
mnsmS
We seek to minimize the mean squared error signal:
A AZHAGUJAISUDHAN RIT ECE
134. 136
3.3.2 LPC Analysis Equations3.3.2 LPC Analysis Equations
∑
∑∑∑
=
∧
=
∧∧
−=
−−=
p
k
nkn
m
nn
p
k
k
m
nn
ka
kmsmsamsE
1
1
2
).,0()0,0(
)()()(
φφ
The minimum mean-squared error can be expressed as:
A AZHAGUJAISUDHAN RIT ECE
135. 137
3.3.3 The Autocorrelation Method3.3.3 The Autocorrelation Method
.
0
1
),()(),(
0
1
),()(),(
)(
.,0
10),().(
)(
)(1
0
1
0
1
0
2
pk
pi
kimsmski
pk
pi
kmsimski
meE
otherwise
Nmmwnms
ms
kiN
m
nnn
pN
m
nnn
pN
m
nn
n
≤≤
≤≤
−+=
≤≤
≤≤
−−=
=
−≤≤+
=
∑
∑
∑
−−−
=
+−
=
+−
=
φ
φ
w(m): a window zero outside 0≤m≤N-1
The mean squared error is:
And:
A AZHAGUJAISUDHAN RIT ECE
136. 138
3.3.3 The Autocorrelation Method3.3.3 The Autocorrelation Method
)(),(
:functionationautocorrelsimpletoreducesfunction
covariancethek,-ioffunctionaonlyis),(Since
.
0
1
),()(),(
)(1
0
kirki
ki
pk
pi
kimsmski
nn
n
kin
m
nnn
−=
≤≤
≤≤
−+= ∑
−−−
=
φ
φ
φ
A AZHAGUJAISUDHAN RIT ECE
137. 139
3.3.3 The Autocorrelation Method3.3.3 The Autocorrelation Method
.
)(
)3(
)2(
)1(
)0(...)3()2()1(
)3(...)0()1()2(
)2(...)1()0()1(
)1(...)2()1()0(
:asformmatrixinexpressedbecanand
1),(|)(|
:)()(i.e.
symmetric,isfunctionationautocorreltheSince
2
1
1
=
−−−
−
−
−
≤≤=−
=−
∧
∧
∧
=
∧
∑
pr
r
r
r
a
a
a
rprprpr
prrrr
prrrr
prrrr
piirakir
sokrkr
n
n
n
n
pnnnn
nnnn
nnnn
nnnn
p
k
nkn
nn
A AZHAGUJAISUDHAN RIT ECE
141. 143
3.3.4 The Covariance Method3.3.4 The Covariance Method
∑
∑
∑
−−
−=
−
=
−
=
≤≤
≤≤
−+=
≤≤
≤≤
−−=
=
−≤≤
1
1
0
1
0
2
.
0
1
),()(),(
,variablesofchangebyor,
0
1
),()(),(
:asdefined),(with
)(
:directlyspeechunweightedtheuseto
and10error tocomputingofintervalthechange
iN
im
nnn
n
N
m
nn
n
N
m
nn
pk
pi
kimsmski
pk
pi
kmsimski
ki
meE
Nm
φ
φ
φ
A AZHAGUJAISUDHAN RIT ECE
142. 144
3.3.4 The Covariance Method3.3.4 The Covariance Method
.
)0,(
)0,3(
)0,2(
)0,1(
),()3,()2,()1,(
),3()3,3()2,3()1,3(
),2()3,2()2,2()1,2(
),1()3,1()2,1()1,1(
3
2
1
=
∧
∧
∧
∧
pa
a
a
a
ppppp
p
p
p
n
n
n
n
p
nnnn
nnnn
nnnn
nnnn
φ
φ
φ
φ
φφφφ
φφφφ
φφφφ
φφφφ
The resulting covariance matrix is symmetric, but not Toeplitz,
and can be solved efficiently by a set of techniques called
Cholesky decomposition
A AZHAGUJAISUDHAN RIT ECE
143. 145
3.3.6 Examples of LPC Analysis3.3.6 Examples of LPC Analysis
A AZHAGUJAISUDHAN RIT ECE
144. REFERENCESREFERENCES
TEXTBOOKS:
1. Lawrence Rabiner and Biing-Hwang Juang, “Fundamentals of Speech Recognition”, Pearson Education,
2003.
2. Daniel Jurafsky and James H Martin, “Speech and Language Processing – An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition”, Pearson Education, 2002.
3. Frederick Jelinek, “Statistical Methods of Speech Recognition”, MIT Press, 1997.
REFERENCES:
1. Steven W. Smith, “The Scientist and Engineer s Guide to Digital Signal Processing”, California‟
Technical Publishing, 1997.
2. Thomas F Quatieri, “Discrete-Time Speech Signal Processing – Principles and Practice”, Pearson
Education, 2004.
3. Claudio Becchetti and Lucio Prina Ricotti, “Speech Recognition”, John Wiley and Sons, 1999.
4. Ben Gold and Nelson Morgan, “Speech and Audio Signal Processing, Processing and Perception of
Speech and Music”, Wiley- India Edition, 2006.
146A AZHAGUJAISUDHAN RIT ECE
Editor's Notes
On board: Presentation of source-filter model.
Here the bandwidth of the all filters is not same. It keeps on increasing logarithmically. For uniform filters the bandwidth that each individual filters spans is the same. So the name uniform.
Recursion:an expression such that each term is generated by repeating a particular mathematical operation
Time-frequency analysis plays a central role in signal analysis. Already long ago it has been recognized that a global Fourier transform of a long time signal is of little practical value to analyze the frequency spectrum of a signal. Transient signals, which are evolving in time in an unpredictable way (like a speech signal or an EEG signal) necessitate the notion of frequency analysis that is local in time.
In many applications such as speech processing, we are interested in the frequency content of a signal locally in time. That is, the signal parameters (frequency content etc.) evolve over time. Such signals are called non-stationary. For a non-stationary signal, x(t), the standard Fourier Transform is not useful for analyzing the signal. Information which is localized in time such as spikes and high frequency bursts cannot be easily detected from the Fourier Transform.
Time-localization can be achieved by first windowing the signal so as to cut off only a well-localized slice of x(t) and then taking its Fourier Transform. This gives rise to the Short Time Fourier Transform, (STFT) or Windowed Fourier Transform. The magnitude of the STFT is called the spectrogram. By restricting to a discrete range of frequencies and times we can obtain an orthogonal basis of functions.