UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
Speech processing
1. Research Issues in Speech Processing
Dr. M. Sabarimalai Manikandan
msm.sabari@gmail.com
2. Speech Production: the source-filter model
Speech signal conveys the information contained in the spoken word
highly non-stationary signal
Short segments of speech (20 to 30 ms )
acoustical energy is in the frequency range of 100-6000 Hz
Vocal tract transfer function can be modeled by an all-pole filter
4. Speech Processing
Speech measurements
Short-time energy (STE)
Zero crossing rate (ZCR)
Autocorrelation (AC)
Pitch period or frequency
Formants
Speech signal components
Speech-Silence or Non-speech
Voiced speech-Unvoiced speech
5. Speech Processing
Speech representations or models
Temporal features
• Low energy rate
• Zero crossing rate (ZCR)
• 4Hz modulation energy
• Pitch contour
Spectral features
• Spectral Centroid (sharpness)
• Spectral Flux (rate of change)
• Spectral Roll-Off (spectral shape)
• Spectral Flatness (deviation of the spectral form)
Linear Predictive Coefficients (LPC)
Cepstral coefficients
Mel Frequency Cepstral Coefficients (MFCC): human auditory system
Harmonic features: sinusoidal harmonic modelling
Perceptual features: model of the human hearing process
First order derivative (DELTA)
6. Elements of the speech signal
Phonemes: the smallest units of speech sounds
Vowels and Consonants
~12 to 21 different vowel sounds used in the English language
Consonants involve rapid and sometimes subtle changes in sound
according to the manner of articulation:
• plosive (p, b, t, etc.)
• fricative (f, s, sh, etc.)
• nasal (m, n, ng)
• liquid (r, l) and
• semivowel (w, y)
Consonants are more independent of language than vowels are.
Syllable: one or more phonemes
Word: one or more syllables
7. Automatic Speech Recognition
There are two uses for speech recognition systems:
Dictation: translation of the spoken word into written text
Computer Control: control of the computer, and software
applications by speaking commands
Speaker dependent system: to operate for a single speaker
Speaker independent system: to operate for any speaker
of a particular type
Speaker adaptive system: to adapt its operation to the
characteristics of new speakers
The size of vocabulary affects the complexity, processing
requirements and the accuracy of the system
8. Speech Recognition: Applications
Automatic translation
Vehicle navigation systems
Human computer Interaction
Content-based spoken audio search
Home automation
Pronunciation evaluation
Robotics
Video games
Transcription of speech into mobile text messages
People with disabilities
9. Speech Recognition System
Sampling of speech
Acoustic signal processing:
• Linear Prediction Cepstral Coefficients (LPCC)
• Mel Frequency Cepstral Coefficients (MFCC)
• Perceptual Linear Prediction Cepstral Coefficients (PLPCC)
Recognition of phonemes, groups of phonemes and words:
• Dynamic Time Warping (DTW)
• hidden Markov models (HMMs)
• Gaussian mixture models (GMMs)
• Neural Networks (NNs)
• Expert systems and combinations of techniques
10. Automatic Speaker Recognition
Speaker recognition: the process of automatically recognizing who is
speaking by using the speaker-specific information included in speech
sounds
Speaker identity: physiological and behavioral characteristics of the speech
production model of an individual speaker
the spectral envelope (vocal tract characteristics)
the supra-segmental features (voice source characteristics) of
speech
Applications:
• banking over a telephone network
• telephone shopping and database access services
• voice dialing and mail
• information and reservation services
• security control for confidential information
• forensics and surveillance applications
11. Speaker Recognition
Speaker identification: the process of determining which registered speaker
provides input speech sounds
Similarity
Ref. template or
model (speaker #1)
Similarity Identification
Input Feature Maximum
speech Extraction result
selection
(Speaker ID)
Ref. template or
model (speaker #2)
Similarity
Ref. template or
model (speaker #N)
12. Speaker Recognition
Speaker verification: the process of accepting or rejecting the
identity claim of a speaker.
Input Feature Verification
speech Extraction Similarity Decision result
(Accept /Reject)
Ref. template Threshold
Input or model
speech (speaker #M)
Open Set and Closed Set Recognition
Text-dependent and Text-independent Recognition
• Vector quantization
• Gaussian mixture models (GMM)
• Dynamic time warping (DTW)
• Hidden Markov model (HMM)
13. Text-to-Speech (TTS) System
Synthesis of Speech for effective human machine communications
reading email messages
call center help desks and customer care
announcement machines
Raw or Text Phonetic Prosodic Speech Synthetic
tagged text Analysis Analysis Analysis Synthesis Speech
Document
Homograph
Structure Pitch Voice Rendering
disambiguation
Detection
Grapheme-to-
Text
Phoneme Duration
Normalization
Conversion
Linguistic
Analysis
Synthetic speech should be intelligible and natural
14. Speech Synthesis
Text-to-speech (TTS) synthesis systems
Approach
TTS system performance measure
• Synthetic Speech Intelligibility
• Synthetic speech naturalness
Speech Intelligibility Tests
Segmental level analysis
• the Rhyme Test
• the Modified Rhyme Test
• the Diagnostic Rhyme Test
Supra-segmental analysis
• the Harvard Psychoacoustic Sentences (HPS)
• the Haskins syntactic sentences
15. Speech Coding (Compression)
Speech Coding for efficient transmission and storage of speech
narrowband and broadband wired telephony
cellular communications
Voice over IP (VoIP) to utilize the Internet
Telephone answering machines
IVR systems
Prerecorded messages
16. Speech-Assisted Translation Corrector System
Objective: Develop a speech-assisted translation corrector (SATC)
system which provides a grammatically correct sentence for a
translated sentence from the machine translation
translated sentence grammatically
input with correct sentence
sentence Multilingual grammatical errors Speech assisted
Machine translation corrector
Translation system text
He came here speech storage
Translator
speech signal is produced from the
words in the translated sentence.
“A MT system is correct and complete if it can analyze of the grammatical structures
encountered in the source language, and it can generate all of the grammatical structures
necessary in the target language translation.”
8/25/2011 16
17. SATC System: Requirements and Challenging Tasks
Creation of large scale rich multilingual speech databases is crucial
task for research and development in language and speech technology
Indian languages
speakers (10 Males and 10 Females)
age groups ( <20, 15-40, >40)
audio format: 16-bit stereo, and sampling rate of 44.1 kHz
annotation and assessment of speech databases
Development of multilingual text to speech interface
Development of spoken word matching module
Development of speech signal processing (SSP) tools
8/25/2011 17
18. Major Problems in Speech Processing
Acoustic variability: the same phonemes pronounced in
different contexts will have different acoustic realization
(coarticulation effect)
The signal is different when speech is uttered in various
environments:
noise
reverberation
different types of microphones.
Speaking variability: when the same speaker speaks normally,
shouts, whispers, uses a creaky voice, or has a cold
Speaker variability: since different speakers have different
timbers and different speaking habits
19. Major Problems in Speech Processing
Linguistic variability: the same sentence can be pronounced
in many different ways, using many different words,
synonyms, and many different syntactic structures and
prosodic schemes
Phonetic variability: due to the different possible
pronunciations of the same words by speakers having
different regional accents
Lombard effect: noise modifies the utterance of the words (as
people tend to speak louder)
20. Major Problems in Speech Processing
Continuous speech:
words are connected together (not separated by pauses or
silences).
It is difficult to find the start and end points of words
The production of each phoneme is affected by the
production of surrounding phonemes
The start and end of words are affected by the preceding
and following words
the rate of speech (fast speech tends to be harder)
21. References
M. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How to
observe and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp.
777-782, 1999
S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981
M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movements
in speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286,
2001
T. Kaburagi and M. Honda, NTT CS Laboratories “A model of articulator trajectory formation based on the motor
tasks of vocal-tract shapes,” J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996.
S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, “Determination of articulatory positions from speech
acoustics by applying dynamic articulatory constraints,” Proc. ICSLP98, pp. 2251-2254, 1998.
Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility using
Semantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.