Speech synthesis


Published on

Published in: Education

Speech synthesis

  1. 1. Speech Synthesis 1
  2. 2. Speech synthesis Speech synthesis is the artificial production of human speech. The computer or instrument used for this purpose is called a speech synthesizer. A Text-To-Speech(TTS) synthesis is production of speech from normal language text. Input text phonetic synthesised levels speech simple text-to-speech synthesis 2 text and linguistic analysis Prosody and speech generation
  3. 3. Stephen Hawking Suffering from motor neuron disease ALS. Lost his speech ability. First computer based speech system was provided by Intel®. Main interface program is EZ KEYS written by Word plus Inc. Cursor is controlled by cheek moments and detected by IR sensor mounted on spectacles. Formed words are sent to speech synthesiser ,hardware made by speech+. 3
  4. 4. Stephen Hawking • Speech synthesiser voice output can also be stored. • Current configuration • Lenovo ThinkPad X220 tablet (2 copies). • Intel® Core™ i7-2620M CPU @ 2.7GHz. • Intel® 150Gb Solid-State Drive 520 Series. • Windows 7. • Speech Synthesizers (3 copies): Manufacturer: Speech+ CA. 4
  5. 5. History of speech synthesizer • First device to be considered as speech synthesiser was VODER introduced by Homer Dudley in 1939 in New York’s world fair. • The first format synthesizer PAT (Parametric Artificial Talk) was introduced by Lawrence in 1953. 5
  6. 6. Architecture of TTS systems 6 Text-to-phoneme module Text input Grapheme-to- phoneme conversion Prosodic modelling Acoustic synthesis Abbreviation lexicon Text in orthographic form Exceptions lexicon Orthographic rules Phoneme string Normalization Grammar rules Phoneme string + prosodic annotation Prosodic model Synthetic speech output Phoneme-to-speech module Various methods
  7. 7. Challenges in speech synthesis • TEXT-To-Phoneme Conversion It is the conversion of input text into linguistic representation, also called as Grapheme-To-Phoneme conversion. • Text Processing In this digits ,numerals, fractions, dates, abbreviations are expanded into full words. • Pronunciation • Next task is to find correct pronunciation. • Homographic words should be pronounced correctly. 7
  8. 8. Challenges in speech synthesis • Prosody – Finding correct intonation, stress, and duration for written text. 8
  9. 9. Text normalization • Text Processing In this digits, numerals, fractions, dates, abbreviations are expanded into full words. Examples; 1750 would be expanded as seventeen-fifty (if year) and one-thousand seven-hundred and fifty (if measure). 5/13 would be expanded as five-thirteenths (if fraction) and May thirteen.  Numbers are especially difficult   233 4488 9
  10. 10. Text normalization • Any text that has a special pronunciation is stored in a lexicon Abbreviations (Mr, Dr, St) Acronyms (UN as UNESCO) Special symbols (&, %) Particular conventions (£5, $5 million, 12°C) 10
  11. 11. Grapheme-to-phoneme conversion • It is the conversion of input text into linguistic representation. • English spelling is complex but largely regular than other languages. • Gross exceptions must be in lexicon • Lexicon features – look-up should be quick. – need rules anyway for unknown words too. 11
  12. 12. Grapheme-to-phoneme conversion • Much easier for some languages (Spanish, Italian, Welsh, Czech, Korean) • Much harder for others (English, French) • Especially if writing system is only partially alphabetic (Arabic, Urdu) • Or not alphabetic at all (Chinese, Japanese) 12
  13. 13. Prosody modelling The voice parameters affected by emotions are usually categorized in three main types: Voice quality contains largely constant voice characteristics over the spoken utterance, such as loudness and breathiness. Pitch contour and its dynamic changes carry important emotional information. Time characteristics contain the general rhythm, speech rate, the lengthening and shortening of the stressed syllables, the length of content words, and the duration an placing of pauses. 13
  14. 14. Prosody modelling • The secondary emotional states are ; Anger  The voice is very breathy and has tense articulation with abrupt changes. Happiness or joy  The voice is breathy and light without tension. Fear or anxiety  Articulation is precise and the voice is irregular and energy at lower frequencies is reduced. Sadness or sorrowness  The articulation precision and the speech rate are also decreased. Disgust or contempt  The average pitch level and the speech rate are also lower compared to normal speech and the number of pauses is high. Whispering and shouting  Whispering is produced by speaking with high breathiness without fundamental frequency.  Shouted speech causes an increased pitch range, intensity and greater variability in it. 14
  15. 15. Acoustic synthesis • Methods, Techniques and Algorithms: Articulatory synthesis Formant synthesis Concatenative synthesis PSOLA Method Microphonemic Method Linear prediction based Methods Sinusoidal Models 15
  16. 16. Articulatory synthesis • Refers to the computational techniques for synthesizing speech based on human vocal tract and articulation processes occurring there. • Wolfgang von Kempelen and others used bellows, reeds and tubes to construct mechanical speaking machine. • Modern versions simulate electronically the effect of articulator positions, vocal tract shape, etc. 16
  17. 17. Formant synthesis • Formant means an acoustic resonance of human vocal tract. • Probably the most widely used synthesis method during last decades • Synthesised speech output is created by using additive synthesis and an acoustic modelling. • SoftVoice synthesizers stimulates the human speech production mechanism using digital oscillators, noise sources, and filters(formant resonators) just like an electronic music synthesizers. 17
  18. 18. Formant synthesis Demo: Microsoft windows • In control panel select “Speech” icon • Type in your text and Preview voice • You may have a choice of voices 18
  19. 19. Concatenative synthesis • Concatenate segments of pre-recorded natural human speech. • Requires database or lexicon of previously recorded human speech covering all the possible segments to be synthesised. • Segment might be phoneme, syllable, word, phrase, or any combination. • Diphone segments can be digitally manipulated for length, pitch and loudness. • Segment boundaries need to be smoothed to avoid distortion. 19
  20. 20. Concatenative synthesis methods • PSOLA (Pitch synchronous Overlap Add) This algorithm is used to concatenate smoothly and provides good controlling for pitch and duration. It is used for commercial synthesis systems. Time domain PSOLA is most commonly used due to its computational efficiency. • Micro-phoneme method The concatenation is made by Linear amplitude-based Interpolation Method between the prototypes. 20
  21. 21. Concatenative synthesis methods • Linear prediction based methods This method is designed originally for speech coding system ,but also used for speech synthesis. Co-variance and auto co-relation is used. • Sinusoidal Models Based on assumption that voice signal can be represented as sum of sine waves with time varying amplitude and frequencies. Sinusoidal models are successfully used in singing voice synthesis using MIDI interface. 21
  22. 22. Speech synthesis demo 22
  23. 23. Speech synthesis demo 23
  24. 24. APPLICATIONS Application for the blind Used for reading and communication aid for blind Current systems are mostly software based ,so with scanner and OCR(optical character recognition) systems Application for deafened and vocally handicapped Provides opportunity to communicate with people who do not understand sign language. HAMLET helps users to express their feelings. HAMLET system is used with high quality TTS such as DECTALK. 24
  25. 25. APPLICATIONS Educational applications Programmed for special tasks like spelling and pronunciation teaching for different languages.  speech synthesizer is connected to word processor which is helpful for proof reading. Applications for telecommunication and multimedia Synthesized speech is used in all kind of telephone enquiry systems. VoiceXML: Internet surfing using voice. 25
  26. 26. PRODUCTS • INFOVOX  INFOVOX speech synthesizer is perhaps one of best known multilingual TTS products.  The latest full commercial version available is INFOVOX IVOX. 26
  27. 27. PRODUCTS • DECTalk  Available for American English, Spanish and German and available in nine different voice personalities, four female , four male and one child. 27
  28. 28. PRODUCTS • Bell Labs Text-to-Speech  Available in English, French, Spanish, Italian, German, Russian, Romanian, Chinese and Japanese. 28
  29. 29. PRODUCTS • SoftVoice SoftVoice is better known for SAM(Software Automatic Mouth) synthesizer for Apple MacinTAlk, Amiga and Attari computers. Fifth generation SoftVoice is also available for windows in 20 different languages. • CNET PSOLA One of the promising method for concatenation synthesis developed by French Telecom CNET(Centre National d’Etudes Télécommunications ). 29
  30. 30. PRODUCTS • Apple Plain Talk Apple developed three different speech synthesis systems for Macintosh PCs. 30
  31. 31. PRODUCTS • Windows Whistler Microsoft Whistler (Whisper Highly Intelligent Stochastic Talker) is a trainable speech synthesis system which is under development at Microsoft Research, Richmond, USA. The system is designed to produce synthetic speech that sounds natural and resembles the acoustic and prosodic characteristics of the original speaker . 31
  32. 32. THANK YOU