Unit 1 speech processing

EC6007 SPEECH PROCESSING
1
AAZHAGUJAISUDHANRITECE

Course Objectives:
1. To enable the students to learn the fundamentals and
classification of speech sounds.
2. To make the students to analyze and compare different
speech parameters using various methods.
3. To equip the students with various speech modelling
techniques.
4. To enable the students to acquire knowledge on various
speech recognition systems.
5. To gain knowledge about the various methods used for the
process of speech synthesis.
2

Course Outcomes:
After completion of the course, it is expected that:
The students will be able to
1. Explain the fundamentals and classification of speech sounds.
2. Analyse, Extract and compare the various speech parameters.
3. Apply an appropriate speech model for a given application.
4. Explain the various speech recognition systems.
5. Apply different speech synthesis techniques depending upon
the classification of speech parameters.
3

UNIT I BASIC CONCEPTS
Speech Fundamentals: Articulatory Phonetics – Production and Classification of Speech
Sounds; Acoustic Phonetics – Acoustics of speech production; Review of Digital Signal
Processing concepts; Short-Time Fourier Transform, Filter-Bank and LPC Methods.
UNIT II SPEECH ANALYSIS
Features, Feature Extraction and Pattern Comparison Techniques: Speech distortion
measures– mathematical and perceptual – Log–Spectral Distance, Cepstral Distances,
Weighted Cepstral Distances and Filtering, Likelihood Distortions, Spectral Distortion using
a Warped Frequency Scale, LPC, PLP and MFCC Coefficients, Time Alignment and
Normalization – Dynamic Time Warping, Multiple Time – Alignment Paths.
UNIT III SPEECH MODELING
Hidden Markov Models: Markov Processes, HMMs – Evaluation, Optimal State Sequence –
Viterbi Search, Baum-Welch Parameter Re-estimation, Implementation issues.
UNIT IV SPEECH RECOGNITION
Large Vocabulary Continuous Speech Recognition: Architecture of a large vocabulary
continuous speech recognition system – acoustics and language models – n-grams, context
dependent sub-word units; Applications and present status.
UNIT V SPEECH SYNTHESIS
Text-to-Speech Synthesis: Concatenative and waveform synthesis methods, sub-word units
for TTS, intelligibility and naturalness – role of prosody, Applications and present status.
4

INTRODUCTION
 Speech processing is the study of speech signals and the
processing methods of these signals.
 Speech Processing is the application of DSP techniques to the
processing and/or analysis of speech signals.
5

 Speech is the most natural form of human-
human communications.
 Speech is related to language; linguistics is a
branch of social science.
 Speech is related to human physiological
capability; physiology is a branch of medical
science.
 Speech is also related to sound and acoustics, a
branch of physical science.
 Therefore, speech is one of the most intriguing
signals that humans work with every day.
6

Speech Processing
Signal
Processing Information
Theory
Phonetics
Acoustics
Algorithms
(Programming)
Fourier transforms
Discrete time filters
AR(MA) models
Entropy
Communication theory
Rate-distortion theory
Statistical SP
Stochastic
models
Psychoacoustics
Room acoustics
Speech production
7

 Analysis of speech signals:
 Fourier analysis; spectrogram
 Autocorrelation; pitch estimation
 Linear prediction; compression, recognition
 Cepstral analysis; pitch estimation, enhancement
8

 Speech coding: Compression of speech signals for
telecommunication
 Speech recognition: Extracting the linguistic content of
the speech signal
 Speaker recognition: Recognizing the identity of
speakers by their voice
 Speech synthesis: Computer generated speech
(e.g., from text)
 Speech enhancement: Improving intelligibility or
perceptual quality of speech signal
9

APPLICATIONS
 Translation of spoken language into text by
computers
 Voice user interfaces such as voice dialing
(Call home)
 Speech to text processing (Word processors or
emails)
 Recognizing the speaker
10

APPLICATIONS OF SPEECH
PROCESSING
 –Human computer interfaces(e.g., speechI/O,
affective)
 –Telecommunication(e.g., speech enhancement,
translation)
 –Assistive technologies(e.g., blindness/deafness,
language learning)
 –Audio mining(e.g., diarization, tagging)
 –Security (e.g., biometrics, forensics)
11

SPEECH PRODUCTION
Lungs
12

Speech Generation
•The production process (generation) begins when the talker
formulates a message in his mind which he wants to transmit to
the listener via speech.
•In case of machine
•First step: message formation in terms of printed text.
•Next step: conversion of the message into a language code.
•After the language code is chosen the talker must execute a
series of neuromuscular commands to cause the vocal cord to
vibrate such that the proper sequence of speech sounds is created.
•The neuromuscular commands must simultaneously control the
movement of lips, jaw, tongue, and velum.
14

SPEECH PERCEPTION
 The speech signal is generated and propagated to the
listener, the speech perception (recognition) process
begins.
 First the listener processes the acoustic signal along
the basilar membrane in the inner ear, which
provides running spectral analysis of the incoming
signal.
 A neural transduction process converts the spectral
signal into activity signals on the auditory nerve.
 Finally the message comprehension (understanding of
meaning) is achieved.
15

SOUND PERCEPTION
 The audible frequency range for human is
approximately 20Hz to 20KHz
 The three distinct parts of the ear are outer ear,
middle ear and inner ear.
 Outer ear:
 The perceived sound is sensitive to the pinna’s shape
 By changing the pinnas shape the sound quality
alters as well as background noise
 After passing through ear cannal sound wave strikes
the eardrum which is part of middle ear.
16

MIDDLE EAR
EAR DRUM
 This oscillates with the frequency as that of the
sound wave
 Movements of this membrane are then transmitted
through the system of small bones called as ossicular
system
 From ossicular system to cochlea.
 Inner ear
 It consist of two membranes Reissner’s membrane and
basilar membrane
 When vibrations enter cochlea they stimulate 20,000 to
30,000 stiff hairs on the basilar membrane
 These hair in turn vibrate and generate electrical signal
that travel to the brain and become sound 19

PHONEME HIERARCHY
Speech sounds
Vowels ConsonantsDiphtongs
Plosive
Nasal
Fricative
Retroflex
liquid
Lateral
liquid
Glide
iy, ih, ae, aa,
ah, ao,ax, eh,
er, ow, uh, uw
ay, ey,
oy, aw
w, y
p, b, t,
d, k, g
m, n, ng f, v, th, dh,
s, z, sh, zh, h
r
l
Language dependent.
About 50 in English.
20

SPEECH WAVEFORM
CHARACTERISTICS
 Loudness
 Voiced/Unvoiced.
 Pitch.
 Fundamental frequency.
 Spectral envelope.
 Formants.
21

THE SPEECH STACK
22

SPEECH CLASSIFICATION
23

VOWELS
 Vowels are produced by exciting an essentially fixed
vocal tract shape with quasi periodic pulses of air caused
by the vibration of the vocal cords.
 A speech sound produced by humans when the breath
flows out through the mouth without being blocked by
the teeth, tongue, or lips
A short vowel is a short sound as in the word "cup"
A long vowel is a long sound as in the word "shoe"
24

VOWELS
25

VOWELS
26

WHY VOWELS ARE EASILY
DECODABLE?
 Vowels are generally long in duration as
compared to consonants.
 Spectrally well defined
 Vowels are easily and reliably recognized by both
human and machine.
 Vowels can be subdivided into three sub groups
based on tongue hump being along the front,
central and back part of the palate. 27

VOWELS
 For the vowel /i/ - eve, beat- the vocal tract is
open at the back, the tongue is raised at the front
and there is a high degree of constriction of the
tongue against the palate
 For the vowel /a/ - father, bob - the vocal tract is
open at the front, the tongue is raised at the back
and there is a low degree of constriction by the
tongue against the palate
28

 i – IY - beat, eve
 I – IH – bit
 e – EH – bet, hate 29

 a – AA - Bob
 - AH- butə
30

 u – UW - boot
 U – UH –book
 O – OW -boat
31

DIPHTONGS
 Diphthongs is a gliding monosyllabic speech sound
that starts at or near the articulatory position for
one vowel and moves to or toward the position for
another.
 According to this there are six diphthongs in
American english.
 Examples : Buy, boy, down, bait
33

DIPHTONGS
 A vowel sound in which the tongue changes position to
produce the sound of two vowels
 A sound formed by the combination of two vowels in a
single syllable
34

SEMIVOWELS
 Groups of sound consisting of /w/ - W - Wit
/l/ - L – Let, /r/ - R – rent is quite difficult to
characterize.
 These sounds are called semivowels because of
their vowel like nature.
 It is characterized by a gliding transition in vocal
tract area function between adjacent phonemes.
35

LIQUIDS
 Liquids is a consonant produced when the tongue approaches a
point of articulation within the mouth but does not come close
enough to obstruct or constrict the flow of air enough to create
turbulence (as with fricatives).
 The primary difference between liquids and glides is that with a
liquid, the tip of the tongue is used, whereas with glides, body of
the tongue is used and not the tip is raised.
 /w/ - W - Wit
/l/ - L - Let
36

GLIDES
 To move easily without stopping and without effort or noise
 Glides – like a liquid, is a consonant produced when the tongue
approaches a point of articulation within the mouth but does not
come close enough to obstruct or constrict the flow of air enough to
create turbulence.
 Unlike nasals, the flow of air is not redirected into the nose. Instead,
as with liquids, the air is still allowed to escape via the mouth.
 /r/ - R - Rent
37

CONSONANTS
 One of the speech sounds or letters of the alphabet that is not a
vowel
 Consonants are pronounced by stopping the air from flowing
easily through the mouth, especially by closing the lips or
touching the teeth with the tongue
 A nasal consonant is one in which air escapes only through the
nose
In English, "m" and "n" are nasal consonants
In hat, H and T are consonants.
m(me), n(no), G(sing)
38

NASAL CONSONANTS
 Nasals – a nasal is a consonant produced by redirecting out air
through the nose instead of allowing it to escape out of the mouth.
 Nasal consonants are /m/ - EM -bottom /n/ - EN -button, are
produced with glottal excitation and vocal tract totally constricted at
some point along the oral passageway.
 The velum is lowered so that air flows through the nasal tract with
sound being radiated through the nostrils
 /m/ - constriction at the lips
 /n/ - constriction is just behind the teeth
39

UNVOICED FRICATIVES
 Produced by exciting the vocal tract by a steady air
flow
 Becomes turbulent in the region of a constriction in
the vocal tract
 Location of the constriction determines the fricative
sound
 /f/-Constriction is near the lips
 /θ/- Constriction is near the teeth
 /s/-Constriction is near the middle of the oral tract
 /sh/-Constriction is near the back of the oral tract
 Vocal tract is separated into two cavities by the
source of noise at the constriction 42

VOICED FRICATIVES
 /v/,/z/ and /zh/ are some of the examples of voiced
fricatives
 The place of constriction for each of the
corresponding phoneme is essentially identical
 Vocal cords will vibrate
 There is only one excitation source i.e glottis
 Eg: vat, assure
43

STOPS/PLOSIVES
 Produced by completely stopping the air flow
 Airstream cannot escape through the mouth
44

VOICED STOPS
 These are transient, non continuant sounds
produced by building up pressure behind a total
constriction somewhere in the oral tract and then
suddenly releasing the pressure
 /b/- Constriction is at the lips
 /d/- Constriction is at the back of the teeth
 /g/- Constriction is near the velum
 No sound is radiated from the lips
 Vocal cords will vibrate
 Their properties are highly influenced by the
vowel that follows the stop consonant. 46

UNVOICED STOPS
 /p/,/t/ and /k/ are some examples
 The vocal cords do not vibrate
47

WHISPERS
 Vocal cords are not vibrate
 Air passes between the arytenoid cartilages to create
audible turbulence during speech
 To convey secret information without being overheard
or to avoid disturbing others in a quiet place such as
a library or place of worship
49

APPROACHES TO AUTOMATIC
SPEECH RECOGNITION BY
MACHINE
There are three approaches
The acoustic-phonetic approach
The pattern recognition approach
The artificial intelligence approach

51

ACOUSTIC-PHONETIC APPROACH
 First step: Segmentation and Labelling
 Second step: To determine a valid word
52

SEGMENTATION AND LABELLING
 Segmenting the speech signal into discrete regions
depending on the acoustic properties of the signal
 Attaching one or more phonetic labels to each
segmented region
 Second step attempts to determine a valid word from
the sequence of phonetic labels.
 The problem is to decode the phoneme lattice in to a
word string such that every instant of time is
included in one of the phonemes in the lattice.
53

 One phoneme can be pronounced in different ways,
therefore a phone group containing similar variants
of a single phoneme is called an allphone
The symbol SIL -Silence
SIL – AO – L – AX – B – AW – T - “all
about”
Lattice structure refer page. No. 38
L , AX and B corresponding to second and
third choices in the lattice. 54

PROBLEMS IN ACOUSTIC PHONETIC
APPROACH
 The method requires extensive knowledge of the
acoustic properties of phonetic units
 For most systems the choice of features is based on
intuition and is not optimal in a well defined and
meaningful sense
 The design of sound classifiers is also not optimal
 No well-defined, automatic procedure exists for tuning
the method on real, labeled speech.
55

PATTERN RECOGNITION APPROACH
 Speech pattern are used directly without explicit
feature determination and segmentation.
 Step one: training of speech patterns
 Step two: recognition of pattern via pattern
comparison
56

PATTERN RECOGNITION APPROACH
 Speech knowledge is brought in to the system via
the training procedure.
 Enough version of a pattern to be recognized are
included in a training set provided to the algorithm.
 Machine learns which acoustic properties of the
speech class are reliable and repeatable across all
training token of the pattern.
57

ADVANTAGES
 Simplicity of use – Mathematical representation is
easy
 Robustness and invariance to different speech
vocabularies, users, feature sets , pattern comparison
algorithms and decision rule.
 Proven high performance
58

ARTIFICIAL INTELLIGENCE
APPROACH
 It is the hybrid of acoustic phonetic and pattern
recognition approach
 This approach recognition procedure according to the
way a person applies its intelligence in visualizing
analyzing and finally making a decision on the
measure acoustic feature
 Neural network - For learning the relationship
between phonetic events and all known inputs as well
as for discrimination between similar sound classes.
59

ACOUSTIC-PHONETIC APPROACH
60

SPEECH ANALYSIS SYSTEM
 It provide an appropriate spectral representation
of the time varying speech signal
 Technique used is linear predictive coding (LPC)
method
61

FEATURE DETECTION STAGE
 Convert the spectral measurements to a set of features
that describe the acoustic properties of the different
phonetic units.
 Features
 Nasality: presence or absence of nasal resonance
 Friction: presence or absence of random excitation in the
speech
 Formant locations: frequencies of the first three resonances
 Voiced and unvoiced classification: periodic and aperiodic
excitation 62

SEGMENTATION AND LABELLING
PHASE
 The system tries to find stable regions
 To label the segmented region according to how well the
features within that region match those of individual
phonetic units
 This stage is the heart of the acoustic-phonetics
recognizer and is the most difficult one to carry out
reliably
 various control strategies are used to limit the range of
segmentation points and label possibilities
 The final output of the recognizer is the word or word
sequence, in some well-defined sense.
63

VOWEL CLASSIFIER
 Formant - bands of frequency that determine the
phonetic quality of a vowel.
 Compact sounds have a concentration of energy in the
middle of the frequency range of the spectrum. An
example is the vowel ɑ which has a relatively high
first formant which is close to the frequency of the
second formant.
 The opposite of compact is diffuse. A diffuse vowel,
such as i, has no centrally located concentration of
energy – the first and second formants are widely
separated.
64

COMPACT AND DIFFUSE VOWEL
65

ACUTE AND GRAVE
 "acute" typically refers to front vowels
 Grave typically refers to back vowels
66

 Three features have been detected over the segment,
first formant, F1, second formant, F2, and duration of
the segment, D.
 The first test separates vowels with low F1 from
vowels with high F1.
 Each of these subsets can be split further on the basis
of F2 measurement with high F2 and low F2.
 The third test is based on segment duration, which
separates tense vowels (large value of D) from lax
vowels (small values of D).
 Finally, a finer test on formant values separates the
remaining unresolved vowels, resolving the vowels
into flat vowels and plain vowels. 67

SPEECH SOUND CLASSIFIER
68

STATISTICAL PATTERN-
RECOGNITION APPROACH TO
SPEECH RECOGNITION
69

 Feature measurement: in which sequence of measurement is made
on the input signal to define “test pattern”.
 The feature measurements are usually the o/p of spectral analysis
technique, such as filter bank analyzer, a LPC, or a DFT analysis
 Pattern training: creates a reference pattern (for different sound
class) called as Template
 Pattern classification: unknown test pattern is compared with each
(sound) class reference pattern and a measure of distance between
the test pattern and each reference pattern is computed.
 Decision logic: the reference pattern similarity (distance) scores are
used to decide which reference pattern best matches the unknown
test pattern.
70

STRENGTHS AND WEAKNESS OF
THE PATTERN-RECOGNITION
MODEL
 The performance of the system is sensitive to the
amount of training data available for creating sound
class reference pattern.( more training, higher
performance)
 The reference patterns are sensitive to the speaking
environment and transmission characteristics of the
medium used to create the speech. (because speech
spectral characteristics are affected by transmission
and background noise)
71

 The method is relatively insensitive syntax and
semantics.
 The system is insensitive to the sound class. So
the techniques are applied to wide range of
speech sounds (phrases).
72

AI APPROACHES TO SPEECH
RECOGNITION
 The basic idea of AI is to compile and incorporate the
knowledge from variety of knowledge sources to solve
the problem.
 Acoustic Knowledge: Knowledge related to sound or
sense of hearing
 Lexical Knowledge: Knowledge of the words of the
language. (decomposing words into sounds)
 Syntactic Knowledge: Knowledge of syntax (rules)
 Semantic Knowledge: Knowledge of the meaning of the
language.
 Pragmatic Knowledge: (sense derived from meaning)
inference ability necessary in resolving ambiguity of
meaning based on ways in which words are generally
used.
73

SEVERAL WAYS TO INTEGRATE
KNOWLEDGE SOURCES WITHIN A
SPEECH RECOGNIZER
 1. Bottom-Up
 2. Top-Down
 3. Black Board
74

BOTTOM-UP” APPROACH:
 The lowest level processes (feature detection,
phonetic decoding) precede higher level processes
(lexical coding) in a sequential manner.
75

TOP-UP” APPROACH:
 In this the language model generate word hypotheses that
are matched against the speech signal, and syntactically
and semantically meaningful sentences are built up on
the basis of word match scores.
76

77
Signal Processing And Analysis
Methods For Speech
Recognition
A AZHAGUJAISUDHAN RIT
ECE

78
Introduction
• Spectral analysis is the process of
defining the speech in different
parameters for further processing
• Eg short term energy, zero crossing
rates, level crossing rates and so on
• Methods for spectral analysis are
therefore considered as core of the
signal processing front end in a speech
recognition system
ECE

80
Spectral Analysis models
• Pattern recognition model
• Acoustic phonetic model
ECE

81
Spectral Analysis Model
Parameter measurement is common in both the systems
ECE

82
Pattern recognition Model
• The three basic steps in pattern recognition
model are
– 1. parameter measurement
– 2. pattern comparison
– 3. decision making
ECE

83
1. Parameter measurement
• To represent the relevant acoustic events in
speech signal in terms of compact efficient
set of speech parameters
• The choice of which parameters to use is
dictated by other consideration
• eg
– computational efficiency,
– type of Implementation ,
– available memory
• The way in which representation is computed
is based on signal processing considerations
ECE

84
Acoustic phonetic Model
ECE

85
Spectral Analysis
• Two methods:
– The Filter Bank spectrum
– The Linear Predictive coding (LPC)
ECE

86
The Filter Bank spectrum
Digital i/p
Spectral representation
The band pass filters coverage spans the frequency range of interest in the signal
ECE

87
1.The Bank of Filters Front end
Processor
• One of the most common approaches
for processing the speech signal is the
bank-of-filters model
• This method takes a speech signal as
input and passes it through a set of
filters in order to obtain the spectral
representation of each frequency band
of interest.
ECE

88
• Eg
• 100-3000 Hz for telephone quality
signal
• 100-8000 Hz for broadband signal
• The individual filters generally do
overlap in frequency
• The output of the ith bandpass filter
• where Wi is the normalized frequency
ECE

89
• Each bandpass filter processes the
speech signal independently to produce
the spectral representation Xn
ECE

90
The Bank of Filters Front end
Processor
ECE

91
Processor
∑
−
=
−=
≤≤=
1
0
)()(
Qi1,)(*)()(
iM
m
i
ii
mnsmh
nhnsns
The sampled speech signal, s(n), is passed
through a bank of Q Band pass filters,
giving the signals
ECE

92
Processor
The bank-of-filters approach obtains the
energy value of the speech signal
considering the following steps:
• Signal enhancement and noise
elimination.- To make the speech signal
more evident to the bank of filters.
• Set of bandpass filters.- Separate the
signal in frequency bands. (uniform/non
uniform filters )
ECE

93
• Nonlinearity.- The filtered signal at
every band is passed through a non
linear function (for example a wave
rectifier full wave or half wave) for
shifting the bandpass spectrum to the
low-frequency band.
ECE

94
Processor
• Low pass filter.- This filter eliminates
the high-frequency generated by the
non linear function.
• Sampling rate reduction and amplitude
compression.- The resulting signals are
now represented in a more economic way
by re-sampling with a reduced rate and
compressing the signal dynamic range.
The role of the final lowpass filter is to eliminate the undesired spectral peaks
ECE

95
Processor
)sin()( nns iii ωα=
Assume that the output of the ith
bandpass filter is a pure
sinusoid at frequency ωI
If full wave rectifier is used as the nonlinearity






<
≥+
=
==
<−=
≥=
0(n)sif1-
0(n)sif1
)(
where
)().())((
:outputtynonlineariThe
0(n)sfor)(
0(n)sfor)())((s
i
i
i
ii
nw
nwnsnsfv
ns
nsnf
iii
i
i
ECE

97
Types of Filter Bank Used For
Speech Recognition
• uniform filter bank
• Non uniform filter bank
ECE

98
uniform filter bank
• The most common filter bank is the uniform filter
bank
• The center frequency, fi, of the ith
bandpass filter is
defined as
• Q is number of filters used in bank of filters
speech.theofrangefrequencyspan the
torequiredfiltersspaceduniformlyofnumbertheisN
signalspeechtheofratesamplingtheisFswhere
Qi1, ≤≤= i
N
Fs
fi
ECE

99
uniform filter bank
• The actual number of filters used in the
filter bank
• bi is the bandwidth of the ith
filter
• There should not be any frequency overlap
between adjacent filter channels
2/NQ ≤
ECE

100
uniform filter bank
If bi < Fs/N, then the certain portions
of the speech spectrum would be
missing from the analysis and the
resulting speech spectrum would not be
considered very meaningful
ECE

101
nonuniform filter bank
• Alternative to uniform filter bank is
• The criterion is to space the filters
uniformly along a logarithmic
frequency scale.
• For a set of Q bandpass filters with
center frequncies fi and bandwidths
bi, 1≤i≤Q, we set
ECE

102
factorgrowth
clogarithmitheisandfilterfirsttheoffrequency
centertheandbandwidtharbitaryareandCwhere
2
)(
2
1
1
1
1
1
,1
1
α
α
f
bb
bff
Qibb
Cb
i
i
j
ji
ii
−
++=
≤≤=
=
∑
−
=
−
ECE

103
Implementations of Filter
Banks
• Depending on the method of designing
the filter bank can be implemented in
various ways.
• Design methods for digital filters fall
into two classes:
– Infinite impulse response (IIR)
(recursive filters)
– Finite impulse response
ECE

104
The FIR filter: (finite impulse response)
or non recursive filter
• The present output is depend on the
present input sample and previous input
samples
• The impulse response is restricted to
finite number of samples
ECE

105
• Advantages:
– Stable, noise less sever
– Excellent design methods are available for
various kinds of FIR filters
– Phase response is linear
• Disadvantage:
– Costly to implement
– Memory requirement and execution time
are high
– Require powerful computational facilities
ECE

106
The IIR filter: (Infinite impulse
response) or recursive filter
• The present output sample is depends
on the present input, past input samples
and output samples
• The impulse response extends over an
infinite duration
ECE

107
• Advantage:
– Simple to design
– Efficient
• Disadvantage:
– Phase response is non linear
– Noise affects more
– Not stable
ECE

108
FIR Filters
signalinput)(
channelitheofoutputtheis)(
channelitheofresponseimpulsetheis)(
1,2,...Qifor)()(
samplesareLwhere1-Ln0)()()(
th
th
1
0
ns
nx
nh
mnsmh
nhnsnx
i
i
L
m
i
ii
∑
−
=
=−=
≤≤∗=
ECE

109
FIR Filters
• Less expensive implementation can be
derived by representing each bandpass
filter by a fixed low pass window ω(n)
modulated by the complex exponential
fiwnseS
eSne
emnmse
emnms
mnsemnx
ennh
i
jw
n
jwnjw
mjw
m
njw
mnjw
m
m
njw
i
njw
i
i
ii
ii
i
i
i
Π=
=
−=
−=
−=
=
−
−
∑
∑
∑
2at)(ofansformFourier trtheis)(where
)(
)()(
)()(
)()()(
)()(
)(
ω
ω
ω
ω
ECE

112
Frequency Domain Interpretation
For Short Term Fourier
Transform
mjw
m
jw ii
emnmseSn −
∑ −= )()()( ω
At n=n0
i
jw
mnmsFTeSn i
ωωω =−= |)]()([)( 00
Where FT[.] denotes Fourier Transform
Sn0(ejωi
) is the conventional Fourier transform
of the windowed signal, s(m)w(n0-m), evaluated
at the frequency ω= ωi
A
ECE

113
Transform
Shows which part of s(m) are used in the computation of
the short time Fourier transform
ECE

114
Transform
• Since w(m) is an FIR filter with size L
then from the definition of Sn(ejωi
) we
can state that
– If L is large, relative to the signal
periodicity then Sn(ejωi
) gives good
frequency resolution
– If L is small, relative to the signal
periodicity then Sn(ejωi
) gives poor
frequency resolutionA AZHAGUJAISUDHAN RIT
ECE

115
Transform
For L=500 points Hamming
window is applied to a
section of voiced speech.
The periodicity of the signal
is seen in the windowed time
waveform as well as in the
short time spectrum in which
the fundamental frequency
and its harmonics show up as
narrow peaks at equally
spaced frequencies.
ECE

116
Transform
For short windows, the time
sequence s(m)w(n-m) doesn’t
show the signal periodicity, nor
does the signal spectrum.
It shows the broad spectral
envelop very well.
ECE

117
Transform
Shows irregular series of local
peaks and valleys due to the
random nature of the unvoiced
speech
ECE

118
Transform
Using the shorter window
smoothes out the random
fluctuations in the short time
spectral magnitude and
shows the broad spectral
envelope very well
ECE

119
Linear Filtering Interpretation of
the short-time Fourier
Transform
• The linear filtering interpretation of
the short time Fourier Transform
• i.e Sn(ejwi
) is a convolution of the low
pass window, w(n), with the speech
signal, s(n), modulated to the center
frequency wi
)()()( nenseSn njwjw ii
ωΘ= −
* From A
ECE

122
Summary of considerations for
speech recognition filter banks
1st
. Type of digital filter used (IIR
(recursive) or FIR (nonrecursive))
• IIR: Advantage: simple to implement and
efficient.
Disadvantage: phase response is nonlinear
• FIR: Advantage: phase response is linear
Disadvantage: expensive in implementation
ECE

123
2nd
. The number of filters to be used in the
filter bank.
1. For uniform filter banks the number of
filters, Q, can not be too small or else the
ability of the filter bank to resolve the
speech spectrum is greatly damaged. The
value of Q less than 8 are generally avoided
2. The value of Q can not be too large, because
the filter bandwidths would eventually be
too narrow for some talker (eg. High-pitch
females) i.e no prominent harmonics would
fall within the band. (in practical systems
the value of Q≤32).
ECE

124
In order to reduce overall computation,
many practical systems have used
nonuniform spaced filter banks
ECE

125
3rd
. The choice of nonlinearity and LPF
used at the output of each channel
• Nonlinearity: Full wave or Half wave
rectifier
• LPF: varies from simple integrator to a
good quality IIR lowpass filter.
ECE

LINEAR PREDICTIVE CODING MODEL
FOR SPEECH RECOGNITION
 LPC provides a good model of the speech signal.
 Voiced region – good approximation
 Unvoiced region - less effective than for voiced region
 LPC is an analytically tractable model. The method of
LPC is mathematically precise and is simple and
straightforward to implement in either software or
hardware.
 Computation required in LPC processing is
considerably less that that required for an all digital
implementation of the bank of filters
 LPC model works well in recognition application. 128

129
3.3.1 The LPC Model3.3.1 The LPC Model
.
)(
1
1
1
)(
)(
)(
)()()(
,)()()(
),(...)2()1()(
1
1
1
21
zA
za
zGU
zS
zH
zGUzSzazS
nGuinsans
pnsansansans
p
i
i
i
p
i
i
i
p
i
i
p
=
−
==
+=
+−=
−++−+−≈
∑
∑
∑
=
−
=
−
=
Convert this to equality by including an excitation term:
A AZHAGUJAISUDHAN RIT ECE

130
3.3.2 LPC Analysis Equations3.3.2 LPC Analysis Equations
.1
)(
)(
)(
)()()()()(
).()(
).()()(
1
1
~
1
~
1
∑
∑
∑
∑
=
−
=
=
=
−==
−−=−=
−=
+−=
p
k
k
k
p
k
k
p
k
k
p
k
k
za
zS
zE
zA
knsansnsnsne
knsans
nGuknsans
The prediction error:
Error transfer function:

131
3.3 LINEAR PREDICTIVE CODING3.3 LINEAR PREDICTIVE CODING
MODEL FOR SPEECHMODEL FOR SPEECH
RECOGNITIONRECOGNITION
)(nu
G
)(/1 zA
)(ns
⊗

132
Linear Prediction Model:Linear Prediction Model:
 Using LP analysis :Using LP analysis :
DT
Impulse
generator
White
Noise
generator
Time varying
Digital
Filter
Voiced
Unvoiced
Pitch
Gain
estimate
V
U
Vocal Tract
Parameters
s(n)
Speech
Signal

 The basic problem of linear prediction analysis isThe basic problem of linear prediction analysis is
to determine the set of predictor coefficientsto determine the set of predictor coefficients
 Spectral characteristics of speech vary over timeSpectral characteristics of speech vary over time
the predictor coefficients at a given time n mustthe predictor coefficients at a given time n must
be estimated from a short segment of speechbe estimated from a short segment of speech
signalsignal
 Short time spectral analysis is performed onShort time spectral analysis is performed on
successive frames of speech, with frame spacingsuccessive frames of speech, with frame spacing
on the order of 10msec.on the order of 10msec.
133
ka

134
.)()(
)(
)()(
)()(
2
1
2
∑ ∑
∑






−−=
=
+=
+=
=m
p
k
nknn
m
nn
n
n
kmsamsE
meE
mneme
mnsmS
We seek to minimize the mean squared error signal:

135
pikiai
kmSimSki
kmSimSamsims
pk
a
E
p
k
nkn
m
nnn
m m
nn
p
k
knn
k
n
,...,2,1),()0,(
)()(),(
)()()()(
,...,2,1,0
1
1
==
−−=
−−=−
==
∂
∂
∑
∑
∑ ∑∑
=
∧
=
∧
φφ
φ
Terms of short-term covariance:
(*)
With this notation, we can write (*) as:
A set of P equations, P unknowns

136
∑
∑∑∑
=
∧
=
∧∧
−=
−−=
p
k
nkn
m
nn
p
k
k
m
nn
ka
kmsmsamsE
1
1
2
).,0()0,0(
)()()(
φφ
The minimum mean-squared error can be expressed as:

137
3.3.3 The Autocorrelation Method3.3.3 The Autocorrelation Method
.
0
1
),()(),(
0
1
),()(),(
)(
.,0
10),().(
)(
)(1
0
1
0
1
0
2
pk
pi
kimsmski
pk
pi
kmsimski
meE
otherwise
Nmmwnms
ms
kiN
m
nnn
pN
m
nnn
pN
m
nn
n
≤≤
≤≤
−+=
≤≤
≤≤
−−=
=


 −≤≤+
=
∑
∑
∑
−−−
=
+−
=
+−
=
φ
φ
w(m): a window zero outside 0≤m≤N-1
The mean squared error is:
And:

138
)(),(
:functionationautocorrelsimpletoreducesfunction
covariancethek,-ioffunctionaonlyis),(Since
.
0
1
),()(),(
)(1
0
kirki
ki
pk
pi
kimsmski
nn
n
kin
m
nnn
−=
≤≤
≤≤
−+= ∑
−−−
=
φ
φ
φ

139
.
)(
)3(
)2(
)1(
)0(...)3()2()1(
)3(...)0()1()2(
)2(...)1()0()1(
)1(...)2()1()0(
:asformmatrixinexpressedbecanand
1),(|)(|
:)()(i.e.
symmetric,isfunctionationautocorreltheSince
2
1
1












=


























−−−
−
−
−
≤≤=−
=−
∧
∧
∧
=
∧
∑
pr
r
r
r
a
a
a
rprprpr
prrrr
prrrr
prrrr
piirakir
sokrkr
n
n
n
n
pnnnn
nnnn
nnnn
nnnn
p
k
nkn
nn

140

141

142

143
3.3.4 The Covariance Method3.3.4 The Covariance Method
∑
∑
∑
−−
−=
−
=
−
=
≤≤
≤≤
−+=
≤≤
≤≤
−−=
=
−≤≤
1
1
0
1
0
2
.
0
1
),()(),(
,variablesofchangebyor,
0
1
),()(),(
:asdefined),(with
)(
:directlyspeechunweightedtheuseto
and10error tocomputingofintervalthechange
iN
im
nnn
n
N
m
nn
n
N
m
nn
pk
pi
kimsmski
pk
pi
kmsimski
ki
meE
Nm
φ
φ
φ

144
3.3.4 The Covariance Method3.3.4 The Covariance Method
.
)0,(
)0,3(
)0,2(
)0,1(
),()3,()2,()1,(
),3()3,3()2,3()1,3(
),2()3,2()2,2()1,2(
),1()3,1()2,1()1,1(
3
2
1
















=


































∧
∧
∧
∧
pa
a
a
a
ppppp
p
p
p
n
n
n
n
p
nnnn
nnnn
nnnn
nnnn
φ
φ
φ
φ
φφφφ
φφφφ
φφφφ
φφφφ






The resulting covariance matrix is symmetric, but not Toeplitz,
and can be solved efficiently by a set of techniques called
Cholesky decomposition

145
3.3.6 Examples of LPC Analysis3.3.6 Examples of LPC Analysis

REFERENCESREFERENCES
 TEXTBOOKS:

 1. Lawrence Rabiner and Biing-Hwang Juang, “Fundamentals of Speech Recognition”, Pearson Education,
2003.
 2. Daniel Jurafsky and James H Martin, “Speech and Language Processing – An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition”, Pearson Education, 2002.
 3. Frederick Jelinek, “Statistical Methods of Speech Recognition”, MIT Press, 1997.

 REFERENCES:

 1. Steven W. Smith, “The Scientist and Engineer s Guide to Digital Signal Processing”, California‟
Technical Publishing, 1997.
 2. Thomas F Quatieri, “Discrete-Time Speech Signal Processing – Principles and Practice”, Pearson
Education, 2004.
 3. Claudio Becchetti and Lucio Prina Ricotti, “Speech Recognition”, John Wiley and Sons, 1999.
 4. Ben Gold and Nelson Morgan, “Speech and Audio Signal Processing, Processing and Perception of
Speech and Music”, Wiley- India Edition, 2006.

146A AZHAGUJAISUDHAN RIT ECE

Unit 1 speech processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unit 1 speech processing

Similar to Unit 1 speech processing (20)

Recently uploaded

Recently uploaded (20)

Unit 1 speech processing

Editor's Notes