Emotional prosody


Published on

Voice-enabled mobile application provide users access to information instantly, naturally, and almost effortlessly. Simple Voice-Commands however, have failed to gain traction, probably because it’s hard to remember the exact utterance of a command phrase. Instead, more lenient and flexible, conversational style software agents have been more successful.
When it comes to communicating results back to the user, a text response often seems enough. Still, to provide a truly hands-free, eyes-free user experience, a text response needs to be synthesized and played through the phone’s speaker.
The quality of the speech synthesis is determined by many factors, including sound quality (sampling rate, dynamic range), prosody (rhythm, stress, and intonation of speech) and maybe less obviously, by Emotional Prosody, conveyed through changes in pitch, loudness, timbre, speech rate, and pauses.
This talk will share some ideas, concepts, and the technology needed, to build a prototype implementation that synthesizes text that was augmented with emotional values.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Emotional prosody

  1. 1. Emotional Prosody Wolf Paulus © 2014 Wolf Paulus
  2. 2. 2001: A Space Odyssey (1968) H.A.L. 9000 © 2014 Wolf Paulus
  3. 3. Real world Speech Synthesis in 2014 • Mac OS X - built-in Text-to-Speech • Cepstral / Voice Forge TTS Web Service • Loquendo - “Emotional Voices” with “expressive cues” Expres sive TT ! S Non-ve rbal exp ression crying , sighin s: ! g, cou ghing E © 2014 Wolf Paulus xample : _Ah _ Ah _Laug _01 Ah_02 A h_01 _ Laugh_ h_03 02 _La ugh_03 🔊 🔊 🔊
  4. 4. Content Structure CUPERTINO, Calif.--(BUSINESS WIRE)--Apple® today announced the all-new Mac Pro® will be available to order starting Thursday, December 19. Redesigned from the inside out, the all-new Mac Pro features the latest Intel Xeon processors, dual workstation-class GPUs, PCIe-based flash storage and ultra-fast ECC memory.! Designed around an innovative unified thermal core, the all-new Mac Pro packs unprecedented performance into an aluminum enclosure that is just 9.9inches tall and one-eighth the volume of the previous generation. Style <h2> All new Mac Pro available Tomorrow </h2> ! <p> Apple today announced the all-new Mac Pro will be © 2014 Wolf Paulus TEXT HTML CSS
  5. 5. Content Structure Visual! Audible Style/Emotion Specification Status Content plain text plain text Structure HTML SSML Speech Synthesis Markup Language (SSML) 1.0 W3C Recommendation 7 September 2004 EmotionML Emotion Markup Language (EmotionML) 1.0 W3C Proposed Recommendation 16 April 2013 Style / Emotion © 2014 Wolf Paulus CSS
  6. 6. SSML ∙ EmotionML <?xml version="1.0"?> <speak version="1.1" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:emo=“http://www.w3.org/2009/10/emotionml" > <s> <emo:emotion category-set= "http://www.w3.org/TR/emotion-voc/xml#everyday-categories"> <emo:category name="angry" value="0.4"/> </emo:emotion> ! What was that all about? </s> </speak> © 2014 Wolf Paulus
  7. 7. Emotion ML <?xml version="1.0"?> <emotionml version="1.0" xmlns="http://www.w3.org/2009/10/emotionml" category-set="http://www.w3.org/TR/emotion-voc/xml#everyday-categories"> Hello and good afternoon. ! ! ! <emotion><category name="angry"/> What was that all about? </emotion> <emotion><category name="happy"/> Nice to see you again! </emotion> <emotion><category name="sad"/> Yeah I also had something else in mind than this. </emotion> ! </emotionml> © 2014 Wolf Paulus German Research Center for Artificial Intelligence 🔊 🔊
  8. 8. Universal Facial Expressions Surprise Sadness Happiness Disgust Fear Anger © 2014 Wolf Paulus .. recognized and produced in all human cultures
  9. 9. Effect of Emotion on the human Voice Pitch Range Volume Rate average pitch
 (a frequency) of the speaking voice variation in 
 average pitch median volume 
 of the waveform speaking rate Intonation (or melody), Timing (or rhythm) © 2014 Wolf Paulus
  10. 10. Effect of Emotion on the human Voice sad happy Pitch angry Range Volume Rate Pitch increases for emotions with high excitation (anger, fear and happiness) Pitch decreases for emotions with little excitation (sadness or boredom) … © 2014 Wolf Paulus Analysis of emotional speech prosody in terms of part of speech tags. Murtaza Bulut, Sungbok Lee, and Shrikanth S. Narayanan. INTERSPEECH, 
 page 626-629. ISCA, (2007)
  11. 11. Uttering bad news, in a happy voice There has been a fatal accident .. © 2014 Wolf Paulus There has been a fatal accident .. you will be 2.8 hours late to your Valentine’s dinner appointment 🔊
  12. 12. Emotional text recognition using Whissell's Dictionary of Affect in Language high Activation - red is negative emotion - blue is positive - size indicates activation love hate sister cold you negative … <word> <token>love</token> <emotion> <measure type="DAL" valence=“3.0" activation=“2.6364” imagery=“1.4”/> </emotion> </word> … © 2014 Wolf Paulus x low Valence positive
  13. 13. SSML, Prosody, and Intonation Editor © 2014 Wolf Paulus
  14. 14. SSML, Prosody, and Intonation Editor © 2014 Wolf Paulus
  15. 15. SSML, Prosody, and Intonation Editor © 2014 Wolf Paulus 🔊
  16. 16. SSML, Prosody, and Intonation Editor © 2014 Wolf Paulus
  17. 17. SSML, Prosody, and Intonation Editor © 2014 Wolf Paulus
  18. 18. SSML, Prosody, and Intonation Editor <?xml version="1.0" encoding=“UTF-8"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/ synthesis.xsd" xml:lang=“en-US"> <voice name=“Lindsey"> <prosody rate="1.4" pitch="0%" volume="50"> There </prosody> <prosody rate="1.3" pitch="-1%" volume="49"> has </prosody> <prosody rate="1.3" pitch="-2%" volume="48"> been </prosody> <prosody rate="1.2" pitch="-4%" volume="47"> a </prosody> <prosody rate="1.2" pitch="-5%" volume="47"> <emphasis level="moderate">fatal</emphasis> </prosody> <prosody rate="1.2" pitch="-7%" volume="46"> accident </prosody> <prosody rate="1.1" pitch="-8%" volume="45"> and </prosody> <prosody rate="1.1" pitch="-10%" volume="45"> traffic </prosody> <prosody rate="1.1" pitch="-11%" volume="44"> is </prosody> <prosody rate="1.0" pitch="-12%" volume="43"> backed </prosody> <prosody rate="1.0" pitch="-14%" volume="42"> up </prosody> <prosody rate="1.0" pitch="-15%" volume="42"> for </prosody> <prosody rate="0.9" pitch="-17%" volume="41"> <emphasis level="moderate">17</emphasis> </prosody> <prosody rate="0.9" pitch="-18%" volume="40"> miles. </prosody> </voice> </speak> © 2014 Wolf Paulus 🔊
  19. 19. “I have to tell you something, Uncle Harold just called. It’s about Martha, you know …” © 2014 Wolf Paulus 🔊
  20. 20. Summary • Speech synthesis examples 
 (HAL 900.. Loquendo) • Two W3C standards to declaratively write text to be synthesized 
 (SSML and EmotionML) • Effect of emotion on the human voice 
 (pitch, rate, range, volume, and intonation) • Augment your data 
 with emotional values, e.g. using EmotionML Automatic recognition of emotions in text
 (Whissell's Dictionary of Affect in Language) • © 2014 Wolf Paulus Varying prosody attributes and intonation creates emotional effects
  21. 21. Thanks for listening @wolfpaulus © 2014 Wolf Paulus