A Case Study on
DSP Speech
Processing
What is Speech Processing?
1. Speech Coding 2. Speech Recognition 3. Speech Verification
4. Speech Enhancement5. Speech Synthesis
Speech processing is the application of digital signal processing (DSP) techniques to
the processing and analysis of speech signals.
The application of speech processing includes:
Process of Speech Production
The speech production process begins
when the talker formulates a message
in his/her mind to transmit to listener
via speech. The next step is the
conversion of the message into a
language code. This corresponds to
converting the message into a set of
phoneme sequences corresponding to
sounds that make up the words, along
with prosody makers denoting duration
of sounds, loudness of sounds, and
pitch associated with the sounds.
Figure: Shows a schematic diagram of the speech
production/perception process in human beings
Information Rate of the Speech Signal
First Stage
The discrete symbol
information rate in the raw
message text is rather low
(about 50 bits per second
corresponding to about 8
sounds per second, where
each sound is one of the
about 50 distinct
symbols). After the
language code conversion,
with the inclusion of
prosody information, the
information rate rises to
about 200 bps.
Second Stage
In the next stage the
representation of the
information in the signal
becomes continuous with
an equivalent rate of about
2000 bps at the
neuromuscular control
level and about 30000-
50000 bps at the acoustic
signal level.
Third Stage
The continuous
information rate at the
basilar membrane is in the
range of 30000-500000
bps, while at the neural
transduction stage is about
2000 bps.
Fourth Stage
The higher-level
processing within the
brain converts the neural
signals to a discrete
representation, which
ultimately is decoded into
a low bit rate message.
Classification of Speech Sound
Type 1: VOICED speech is produced when the vocal cords play an active role in the
production of sound:
•50: 200 Hz for male speakers
•150: 300 Hz for female speakers.
•200: 400 Hz for child speakers.
Example: Voiced sounds (A), (E), (I).
Type 2: UNVOICED Speech is produced when vocal cords are inactive.
The vocal cords are held open and air flows continuously through them.
Example: Unvoiced sounds (S), (F).
Formant Frequencies
Speech normally exhibits one formant frequency in
every 1KHz. For VOICED speech, the magnitude of
the lower formant frequencies is successively larger
than the magnitude of the higher formant frequencies.
For UNVOICED speech, the magnitude of the higher
formant frequencies is successively larger than the
magnitude of the lower formant frequencies.
Basic Assumption of Speech Processing
Parameters & Speech Sound
1. Phonemes: Smallest segments of speech sounds /d/ and /b/ are distinct
phonemes e.g. dark and bark.
2. It is important to realize, that phonemes are abstract linguistic units and may
not be directly observed in the speech signal.
3. Different speakers producing the same string of phonemes convey the same
information yet sound different as a result of differences in dialect and vocal
tract length and shape.
4. There are about 40 phonemes in English.
5. We can see the table for IPA (international Phonetic Alphabet) symbol for each
phoneme together with sample words in which they occur.
Model for Speech Production
To develop an accurate model for how speech is produced, it is necessary to develop a digital
filter-based model of human speech production mechanism. The model must contain 4 steps:
Steps of Speech Production
Operation of the Vocal Tract
Lip/Nasal Radiation Process
Both Voice and Unvoiced Speech
Time Frame: 10-20ms
Overall Speech Production Model
Thank You……..

A Case Study on DSP (Speech Processing)

  • 1.
    A Case Studyon DSP Speech Processing
  • 2.
    What is SpeechProcessing? 1. Speech Coding 2. Speech Recognition 3. Speech Verification 4. Speech Enhancement5. Speech Synthesis Speech processing is the application of digital signal processing (DSP) techniques to the processing and analysis of speech signals. The application of speech processing includes:
  • 3.
    Process of SpeechProduction The speech production process begins when the talker formulates a message in his/her mind to transmit to listener via speech. The next step is the conversion of the message into a language code. This corresponds to converting the message into a set of phoneme sequences corresponding to sounds that make up the words, along with prosody makers denoting duration of sounds, loudness of sounds, and pitch associated with the sounds. Figure: Shows a schematic diagram of the speech production/perception process in human beings
  • 4.
    Information Rate ofthe Speech Signal First Stage The discrete symbol information rate in the raw message text is rather low (about 50 bits per second corresponding to about 8 sounds per second, where each sound is one of the about 50 distinct symbols). After the language code conversion, with the inclusion of prosody information, the information rate rises to about 200 bps. Second Stage In the next stage the representation of the information in the signal becomes continuous with an equivalent rate of about 2000 bps at the neuromuscular control level and about 30000- 50000 bps at the acoustic signal level. Third Stage The continuous information rate at the basilar membrane is in the range of 30000-500000 bps, while at the neural transduction stage is about 2000 bps. Fourth Stage The higher-level processing within the brain converts the neural signals to a discrete representation, which ultimately is decoded into a low bit rate message.
  • 5.
    Classification of SpeechSound Type 1: VOICED speech is produced when the vocal cords play an active role in the production of sound: •50: 200 Hz for male speakers •150: 300 Hz for female speakers. •200: 400 Hz for child speakers. Example: Voiced sounds (A), (E), (I). Type 2: UNVOICED Speech is produced when vocal cords are inactive. The vocal cords are held open and air flows continuously through them. Example: Unvoiced sounds (S), (F).
  • 6.
    Formant Frequencies Speech normallyexhibits one formant frequency in every 1KHz. For VOICED speech, the magnitude of the lower formant frequencies is successively larger than the magnitude of the higher formant frequencies. For UNVOICED speech, the magnitude of the higher formant frequencies is successively larger than the magnitude of the lower formant frequencies.
  • 7.
    Basic Assumption ofSpeech Processing Parameters & Speech Sound 1. Phonemes: Smallest segments of speech sounds /d/ and /b/ are distinct phonemes e.g. dark and bark. 2. It is important to realize, that phonemes are abstract linguistic units and may not be directly observed in the speech signal. 3. Different speakers producing the same string of phonemes convey the same information yet sound different as a result of differences in dialect and vocal tract length and shape. 4. There are about 40 phonemes in English. 5. We can see the table for IPA (international Phonetic Alphabet) symbol for each phoneme together with sample words in which they occur.
  • 8.
    Model for SpeechProduction To develop an accurate model for how speech is produced, it is necessary to develop a digital filter-based model of human speech production mechanism. The model must contain 4 steps: Steps of Speech Production Operation of the Vocal Tract Lip/Nasal Radiation Process Both Voice and Unvoiced Speech Time Frame: 10-20ms
  • 9.
  • 10.