Speech processing


Published on

Published in: Education, Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Speech processing

  1. 1. Research Issues in Speech Processing Dr. M. Sabarimalai Manikandan msm.sabari@gmail.com
  2. 2. Speech Production: the source-filter modelSpeech signal conveys the information contained in the spoken word highly non-stationary signal Short segments of speech (20 to 30 ms ) acoustical energy is in the frequency range of 100-6000 Hz Vocal tract transfer function can be modeled by an all-pole filter
  3. 3. Speech Processing TasksSpeech recognition (recognizing lexical content)Speech synthesis (Text-to speech)Speaker recognition (recognizing who is speaking)Speech understanding and vocal dialogSpeech coding (data rate deduction)Speech enhancement (Noise reduction)Speech transmission (noise free communication)Voice conversion
  4. 4. Speech ProcessingSpeech measurements Short-time energy (STE) Zero crossing rate (ZCR) Autocorrelation (AC) Pitch period or frequency FormantsSpeech signal components Speech-Silence or Non-speech Voiced speech-Unvoiced speech
  5. 5. Speech ProcessingSpeech representations or models Temporal features • Low energy rate • Zero crossing rate (ZCR) • 4Hz modulation energy • Pitch contour Spectral features • Spectral Centroid (sharpness) • Spectral Flux (rate of change) • Spectral Roll-Off (spectral shape) • Spectral Flatness (deviation of the spectral form) Linear Predictive Coefficients (LPC) Cepstral coefficients Mel Frequency Cepstral Coefficients (MFCC): human auditory system Harmonic features: sinusoidal harmonic modelling Perceptual features: model of the human hearing process First order derivative (DELTA)
  6. 6. Elements of the speech signalPhonemes: the smallest units of speech sounds Vowels and Consonants ~12 to 21 different vowel sounds used in the English language Consonants involve rapid and sometimes subtle changes in sound according to the manner of articulation: • plosive (p, b, t, etc.) • fricative (f, s, sh, etc.) • nasal (m, n, ng) • liquid (r, l) and • semivowel (w, y) Consonants are more independent of language than vowels are.Syllable: one or more phonemesWord: one or more syllables
  7. 7. Automatic Speech RecognitionThere are two uses for speech recognition systems: Dictation: translation of the spoken word into written text Computer Control: control of the computer, and software applications by speaking commands Speaker dependent system: to operate for a single speaker Speaker independent system: to operate for any speaker of a particular type Speaker adaptive system: to adapt its operation to the characteristics of new speakers The size of vocabulary affects the complexity, processing requirements and the accuracy of the system
  8. 8. Speech Recognition: ApplicationsAutomatic translationVehicle navigation systemsHuman computer InteractionContent-based spoken audio searchHome automationPronunciation evaluationRoboticsVideo gamesTranscription of speech into mobile text messagesPeople with disabilities
  9. 9. Speech Recognition SystemSampling of speechAcoustic signal processing: • Linear Prediction Cepstral Coefficients (LPCC) • Mel Frequency Cepstral Coefficients (MFCC) • Perceptual Linear Prediction Cepstral Coefficients (PLPCC)Recognition of phonemes, groups of phonemes and words: • Dynamic Time Warping (DTW) • hidden Markov models (HMMs) • Gaussian mixture models (GMMs) • Neural Networks (NNs) • Expert systems and combinations of techniques
  10. 10. Automatic Speaker RecognitionSpeaker recognition: the process of automatically recognizing who isspeaking by using the speaker-specific information included in speechsoundsSpeaker identity: physiological and behavioral characteristics of the speechproduction model of an individual speaker the spectral envelope (vocal tract characteristics) the supra-segmental features (voice source characteristics) of speechApplications: • banking over a telephone network • telephone shopping and database access services • voice dialing and mail • information and reservation services • security control for confidential information • forensics and surveillance applications
  11. 11. Speaker RecognitionSpeaker identification: the process of determining which registered speakerprovides input speech sounds Similarity Ref. template or model (speaker #1) Similarity Identification Input Feature Maximum speech Extraction result selection (Speaker ID) Ref. template or model (speaker #2) Similarity Ref. template or model (speaker #N)
  12. 12. Speaker RecognitionSpeaker verification: the process of accepting or rejecting theidentity claim of a speaker. Input Feature Verification speech Extraction Similarity Decision result (Accept /Reject) Ref. template Threshold Input or model speech (speaker #M) Open Set and Closed Set Recognition Text-dependent and Text-independent Recognition • Vector quantization • Gaussian mixture models (GMM) • Dynamic time warping (DTW) • Hidden Markov model (HMM)
  13. 13. Text-to-Speech (TTS) System Synthesis of Speech for effective human machine communications reading email messages call center help desks and customer care announcement machinesRaw or Text Phonetic Prosodic Speech Synthetictagged text Analysis Analysis Analysis Synthesis Speech Document Homograph Structure Pitch Voice Rendering disambiguation Detection Grapheme-to- Text Phoneme Duration Normalization Conversion Linguistic Analysis Synthetic speech should be intelligible and natural
  14. 14. Speech SynthesisText-to-speech (TTS) synthesis systems Approach TTS system performance measure • Synthetic Speech Intelligibility • Synthetic speech naturalnessSpeech Intelligibility Tests Segmental level analysis • the Rhyme Test • the Modified Rhyme Test • the Diagnostic Rhyme Test Supra-segmental analysis • the Harvard Psychoacoustic Sentences (HPS) • the Haskins syntactic sentences
  15. 15. Speech Coding (Compression)Speech Coding for efficient transmission and storage of speech narrowband and broadband wired telephony cellular communications Voice over IP (VoIP) to utilize the Internet Telephone answering machines IVR systems Prerecorded messages
  16. 16. Speech-Assisted Translation Corrector System Objective: Develop a speech-assisted translation corrector (SATC) system which provides a grammatically correct sentence for a translated sentence from the machine translation translated sentence grammaticallyinput with correct sentencesentence Multilingual grammatical errors Speech assisted Machine translation corrector Translation system textHe came here speech storage Translator speech signal is produced from the words in the translated sentence.“A MT system is correct and complete if it can analyze of the grammatical structuresencountered in the source language, and it can generate all of the grammatical structuresnecessary in the target language translation.”8/25/2011 16
  17. 17. SATC System: Requirements and Challenging Tasks Creation of large scale rich multilingual speech databases is crucial task for research and development in language and speech technology Indian languages speakers (10 Males and 10 Females) age groups ( <20, 15-40, >40) audio format: 16-bit stereo, and sampling rate of 44.1 kHz annotation and assessment of speech databases Development of multilingual text to speech interface Development of spoken word matching module Development of speech signal processing (SSP) tools8/25/2011 17
  18. 18. Major Problems in Speech ProcessingAcoustic variability: the same phonemes pronounced indifferent contexts will have different acoustic realization(coarticulation effect)The signal is different when speech is uttered in variousenvironments: noise reverberation different types of microphones.Speaking variability: when the same speaker speaks normally,shouts, whispers, uses a creaky voice, or has a coldSpeaker variability: since different speakers have differenttimbers and different speaking habits
  19. 19. Major Problems in Speech ProcessingLinguistic variability: the same sentence can be pronouncedin many different ways, using many different words,synonyms, and many different syntactic structures andprosodic schemesPhonetic variability: due to the different possiblepronunciations of the same words by speakers havingdifferent regional accentsLombard effect: noise modifies the utterance of the words (aspeople tend to speak louder)
  20. 20. Major Problems in Speech ProcessingContinuous speech: words are connected together (not separated by pauses or silences). It is difficult to find the start and end points of words The production of each phoneme is affected by the production of surrounding phonemes The start and end of words are affected by the preceding and following words the rate of speech (fast speech tends to be harder)
  21. 21. ReferencesM. Honda, NTT CS Laboratories, Speech synthesis technology based on speech production mechanism, How toobserve and mimic speech production by human, Journal of the Acoustical Society of Japan, Vol. 55, No. 11, pp.777-782, 1999S. Saito and K. Nakata, Fundamentals of Speech Signal Processing, 1981M. Honda, H. Gomi, T. Ito and A. Fujino, NTT CS Laboratories, Mechanism of articulatory cooperated movementsin speech production, Proceedings of Autumn Meeting of the Acoustical Society of Japan, Vol. 1, pp. 283-286,2001T. Kaburagi and M. Honda, NTT CS Laboratories “A model of articulator trajectory formation based on the motortasks of vocal-tract shapes,” J. Acoust. Soc. Am. Vol. 99, pp. 3154-3170, 1996.S. Suzuki, T. Okadome and M. Honda, NTT CS Laboratories, “Determination of articulatory positions from speechacoustics by applying dynamic articulatory constraints,” Proc. ICSLP98, pp. 2251-2254, 1998.Benoit, C. and Grice, M. The SUS test: a method for the assessment of text-to-speech intelligibility usingSemantically Unpredictable Sentences, Speech Communication, vol. 18, pp. 381-392.