Speech processing

Research Issues in Speech Processing

Dr. M. Sabarimalai Manikandan
msm.sabari@gmail.com

Speech Production: the source-filter model
Speech signal conveys the information contained in the spoken word
highly non-stationary signal
Short segments of speech (20 to 30 ms )
acoustical energy is in the frequency range of 100-6000 Hz

Vocal tract transfer function can be modeled by an all-pole filter

Speech Processing Tasks

Speech recognition (recognizing lexical content)
Speech synthesis (Text-to speech)
Speaker recognition (recognizing who is speaking)
Speech understanding and vocal dialog
Speech coding (data rate deduction)
Speech enhancement (Noise reduction)
Speech transmission (noise free communication)
Voice conversion

Speech Processing
Speech measurements
Short-time energy (STE)
Zero crossing rate (ZCR)
Autocorrelation (AC)
Pitch period or frequency
Formants

Speech signal components
Speech-Silence or Non-speech
Voiced speech-Unvoiced speech

Speech Processing
Speech representations or models
Temporal features
• Low energy rate
• Zero crossing rate (ZCR)
• 4Hz modulation energy
• Pitch contour

Spectral features
• Spectral Centroid (sharpness)
• Spectral Flux (rate of change)
• Spectral Roll-Off (spectral shape)
• Spectral Flatness (deviation of the spectral form)
Linear Predictive Coefficients (LPC)
Cepstral coefficients
Mel Frequency Cepstral Coefficients (MFCC): human auditory system
Harmonic features: sinusoidal harmonic modelling
Perceptual features: model of the human hearing process
First order derivative (DELTA)

Elements of the speech signal
Phonemes: the smallest units of speech sounds
Vowels and Consonants
~12 to 21 different vowel sounds used in the English language

Consonants involve rapid and sometimes subtle changes in sound
according to the manner of articulation:
• plosive (p, b, t, etc.)
• fricative (f, s, sh, etc.)
• nasal (m, n, ng)
• liquid (r, l) and
• semivowel (w, y)

Consonants are more independent of language than vowels are.

Syllable: one or more phonemes

Word: one or more syllables

Automatic Speech Recognition
There are two uses for speech recognition systems:

Dictation: translation of the spoken word into written text
Computer Control: control of the computer, and software
applications by speaking commands

Speaker dependent system: to operate for a single speaker
Speaker independent system: to operate for any speaker
of a particular type
Speaker adaptive system: to adapt its operation to the
characteristics of new speakers

The size of vocabulary affects the complexity, processing
requirements and the accuracy of the system

Speech Recognition: Applications

Automatic translation
Vehicle navigation systems
Human computer Interaction
Content-based spoken audio search
Home automation
Pronunciation evaluation
Robotics
Video games
Transcription of speech into mobile text messages
People with disabilities

Speech Recognition System

Sampling of speech

Acoustic signal processing:
• Linear Prediction Cepstral Coefficients (LPCC)
• Mel Frequency Cepstral Coefficients (MFCC)
• Perceptual Linear Prediction Cepstral Coefficients (PLPCC)

Recognition of phonemes, groups of phonemes and words:
• Dynamic Time Warping (DTW)
• hidden Markov models (HMMs)
• Gaussian mixture models (GMMs)
• Neural Networks (NNs)
• Expert systems and combinations of techniques

Automatic Speaker Recognition
Speaker recognition: the process of automatically recognizing who is
speaking by using the speaker-specific information included in speech
sounds

Speaker identity: physiological and behavioral characteristics of the speech
production model of an individual speaker
the spectral envelope (vocal tract characteristics)
the supra-segmental features (voice source characteristics) of
speech

Applications:
• banking over a telephone network
• telephone shopping and database access services
• voice dialing and mail
• information and reservation services
• security control for confidential information
• forensics and surveillance applications

Speaker Recognition
Speaker identification: the process of determining which registered speaker
provides input speech sounds

Similarity

Ref. template or
model (speaker #1)

Similarity Identification
Input Feature Maximum
speech Extraction result
selection
(Speaker ID)
Ref. template or
model (speaker #2)

Similarity

Ref. template or
model (speaker #N)

Speaker Recognition
Speaker verification: the process of accepting or rejecting the
identity claim of a speaker.
Input Feature Verification
speech Extraction Similarity Decision result
(Accept /Reject)

Ref. template Threshold
Input or model
speech (speaker #M)

Open Set and Closed Set Recognition

Text-dependent and Text-independent Recognition
• Vector quantization
• Gaussian mixture models (GMM)
• Dynamic time warping (DTW)
• Hidden Markov model (HMM)

Text-to-Speech (TTS) System
Synthesis of Speech for effective human machine communications
reading email messages
call center help desks and customer care
announcement machines

Raw or Text Phonetic Prosodic Speech Synthetic
tagged text Analysis Analysis Analysis Synthesis Speech

Document
Homograph
Structure Pitch Voice Rendering
disambiguation
Detection

Grapheme-to-
Text
Phoneme Duration
Normalization
Conversion

Linguistic
Analysis

Synthetic speech should be intelligible and natural

Speech Synthesis

Text-to-speech (TTS) synthesis systems
Approach
TTS system performance measure
• Synthetic Speech Intelligibility
• Synthetic speech naturalness

Speech Intelligibility Tests
Segmental level analysis
• the Rhyme Test
• the Modified Rhyme Test
• the Diagnostic Rhyme Test
Supra-segmental analysis
• the Harvard Psychoacoustic Sentences (HPS)
• the Haskins syntactic sentences

Speech Coding (Compression)
Speech Coding for efficient transmission and storage of speech
narrowband and broadband wired telephony
cellular communications
Voice over IP (VoIP) to utilize the Internet
Telephone answering machines
IVR systems
Prerecorded messages

Speech-Assisted Translation Corrector System

Objective: Develop a speech-assisted translation corrector (SATC)
system which provides a grammatically correct sentence for a
translated sentence from the machine translation
translated sentence grammatically
input with correct sentence
sentence Multilingual grammatical errors Speech assisted
Machine translation corrector
Translation system text

He came here speech storage
Translator
speech signal is produced from the
words in the translated sentence.

“A MT system is correct and complete if it can analyze of the grammatical structures
encountered in the source language, and it can generate all of the grammatical structures
necessary in the target language translation.”
8/25/2011 16

SATC System: Requirements and Challenging Tasks

Creation of large scale rich multilingual speech databases is crucial
task for research and development in language and speech technology

Indian languages
speakers (10 Males and 10 Females)
age groups ( <20, 15-40, >40)
audio format: 16-bit stereo, and sampling rate of 44.1 kHz
annotation and assessment of speech databases

Development of multilingual text to speech interface

Development of spoken word matching module

Development of speech signal processing (SSP) tools

8/25/2011 17

Major Problems in Speech Processing
Acoustic variability: the same phonemes pronounced in
different contexts will have different acoustic realization
(coarticulation effect)

The signal is different when speech is uttered in various
environments:
noise
reverberation
different types of microphones.

Speaking variability: when the same speaker speaks normally,
shouts, whispers, uses a creaky voice, or has a cold

Speaker variability: since different speakers have different
timbers and different speaking habits

Linguistic variability: the same sentence can be pronounced
in many different ways, using many different words,
synonyms, and many different syntactic structures and
prosodic schemes

Phonetic variability: due to the different possible
pronunciations of the same words by speakers having
different regional accents

Lombard effect: noise modifies the utterance of the words (as
people tend to speak louder)

Continuous speech:
words are connected together (not separated by pauses or
silences).

It is difficult to find the start and end points of words

The production of each phoneme is affected by the
production of surrounding phonemes

The start and end of words are affected by the preceding
and following words

the rate of speech (fast speech tends to be harder)

References

M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to
observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp.
777-782, 1999

S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981

M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements
in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286,
2001

T. Kaburagi and M. Honda, NTT CS Laboratories “A model of articulator trajectory formation based on the motor
tasks of vocal-tract shapes,” J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996.

S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, “Determination of articulatory positions from speech
acoustics by applying dynamic articulatory constraints,” Proc. ICSLP98, pp. 2251-2254, 1998.

Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using
Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.

Speech processing

In this document

More Related Content

What's hot

Viewers also liked

Similar to Speech processing

Recently uploaded

Speech processing