Voice Browsing And Multimodal Interaction In 2009

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Voice Browsing And Multimodal Interaction In 2009 - Presentation Transcript

    1. Voice Browser and Multimodal Interaction In 2009 Paolo Baggia Director of International Standards March 6th, 2009 Google TechTalk Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
    2. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
    3. Company Profile Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing. Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices. Multilingual, proprietary technologies protected over 100 patents worldwide Munich London Financially robust, break-even reached in 2004, revenues and earnings growing year on year Paris Growth-plan investment approved for the evolution of products and services. Madrid Offices in New York. Headquarters in Torino, Torino local representative sales offices in Rome, New York Rome Madrid, Paris, London, Munich Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers. Google TechTalk – Mar 6th, 2009 Paolo Baggia 3
    4. International Awards “2008 Frost & Sullivan European Telematics and Infotainment Emerging Company of the Year” Award Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008 Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award “Best Innovation in Automotive Speech Synthesis” Prize AVIOS-SpeechTEK West 2007 “Best Innovation in Expressive Speech Synthesis” Prize AVIOS-SpeechTEK West 2006 “Best Innovation in Multi-Lingual Speech Synthesis” Prize AVIOS-SpeechTEK West 2005 Google TechTalk – Mar 6th, 2009 Paolo Baggia 4
    5. A Bit of History Google TechTalk – Mar 6th, 2009 Paolo Baggia 5
    6. Standard Bodies Two main standard bodies: W3C – World Wide Web Consortium Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan). 400 members all over the world, 50 Working, Interest and Coordination Groups. W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence) IETF – Internet Engineering Task Force Founded in 1986, but growth in 1991as Internet Society. 1300 members. HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms. Two industrial forums: VoiceXML Forum (www.voicexml.org) Inventors of VoiceXML 1.0, then submitted to W3C for standardization. Current goal is to promote, disseminate and support VoiceXML and related standards. SALT Forum (www.saltforum.org) Supported by Microsoft to define a lightweight markup for telephony and multimodal applications. Other relevant bodies: 3GPP, OMA, ETSI, NIST Google TechTalk – Mar 6th, 2009 Paolo Baggia 6
    7. The (r)evolution of VoiceXML 1998 - 2004 W3C charters W3C charters Voice Browser Multimodal Interaction WG WG EMMA 1.0 By Cisco, Comverse, VoiceXML W3C Rec SALT Forum Intel, Microsoft, Philips, Forum Birth Birth SpeechWorks, PLS 1.0 By AT&T, IBM, W3C REC Lucent, Motorola, 2007 2004 2000 1998 2009 2008 1999 2002 SSML 1.0 W3C Voice SISR 1.0 W3C Rec SRGS 1.0 Browser W3C Rec VoiceXML 1.0 W3C Rec VoiceXML 2.0 VoiceXML 2.0 Workshop Released W3C Rec W3C Rec Preparing to announce VoiceXML 1.0 Friday Feb. 25th, 2000 Lucent, Naperville, Illinois Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM), Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola). Google TechTalk – Mar 6th, 2009 Paolo Baggia 7
    8. Speech Interface Framework in 2000 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 8
    9. Speech Interface Framework - Today (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 9
    10. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 10
    11. W3C Process Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
    12. Architectural Changes Traditional (proprietary) architecture ASR / DTMF Speech Proprietary User SCE Applic. TTS / Audio Proprietary platform .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 12
    13. The VoiceXML Impact VoiceXML changed the landscape of IVRs and speech application creation From proprietary to standard-based speech applications Before After • Standard VoiceXML • Proprietary platforms platforms (HW & SW) • Standards for Speech • Proprietary Technologies applications (by proprietary SCE) • Standard tools for VoiceXML applications • Mainly DTMF and pre-recorded prompts • Integration of DTMF and ASR • First attempts to add speech into IVR • Still predominance of DTMF, but more and more speech applications Google TechTalk – Mar 6th, 2009 Paolo Baggia 13
    14. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
    15. Standards for ASR and DTMF SRGS 1.0, SISR 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 15
    16. W3C Standards for Speech/DTMF Grammars SEMANTICS SYNTAX Speech Defines constraints on Describes how to admissible sentences for grammar produce results after a specific recognition turn an utterance is recognized SRGS SISR SRGS SISR ABNF XML literal script ABNF XML literal script voice dtmf voice dtmf http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 16
    17. SRGS/SISR Grammars for “Torino” SRGS XML SRGS ABNF <?xml version=\"1.0\" encoding=\"UTF-8\"?> <grammar xml:lang=\"en-US\" version=\"1.0\" xmlns=\"http://www.w3.org/2001/06/grammar\" #ABNF 1.0 iso-8859-1; tag-format=\"semantics/1.0-literals\"> SISR mode voice; tag-format <semantics/1.0-literals>; <rule id=\"main\" scope=\"public\"> <token>Torino</token> literal <tag>10100</tag> public $main = Torino {10100} ; </rule> </grammar> <?xml version=\"1.0\" encoding=\"UTF-8\"?> <grammar xml:lang=\"en-US\" version=\"1.0\" #ABNF 1.0 iso-8859-1; xmlns=\"http://www.w3.org/2001/06/grammar \" tag-format=\"semantics/1.0\"> mode voice; SISR tag-format <semantics/1.0>; <tag>var unused=7;</tag> <rule id=\"main\" scope=\"public\"> script {var unused=7;}; <token>Torino</token> public $main = Torino {out=\"10100\";} ; <tag>out=\"10100\";</tag> </rule> </grammar> Google TechTalk – Mar 6th, 2009 Paolo Baggia 17
    18. SRGS/SISR Standards – Pros Powerful syntax (CFG) and very powerful semantics (ECMA) DMTF and Voice input are transparent to the application Wide and consistent adoption among technology vendors Two syntax XML and ABNF are great! Developers can choose (XML validation vs. compact format) Transformations are possible XML ABNF (easy, simple XSLT) ABNF XML (requires a ABNF parser) Open Source tools might be created to: Validate grammar syntax Transform grammars Debug grammars on written input Coverage tests: explode covered sentences, GenSem, SemTester, etc. Google TechTalk – Mar 6th, 2009 Paolo Baggia 18
    19. SRGS/SISR Standards – Small Issues Semantics declaration: tag-format attribute If value “semantics/1.0”? Mandate SISR Script semantics inside semantic tags If value “semantics/1.0-literal”? Mandate SISR Literal semantics inside semantic tags If missing? Unclear! Risk of interoperability troubles SISR Script Semantics Clumsy default assignment: returns last referenced rule only Developer must properly pop-up results Be careful to redefine “out” Assign a scalar value might result in errors SISR Literal Semantics Only useful for very simple word-list rules No support for encapsulating rules SISR Literal grammars as external references ONLY! Google TechTalk – Mar 6th, 2009 Paolo Baggia 19
    20. SRGS/SISR – Encapsulated Grammars Gr2.gram Literal Gr41.grxml Gr1.grxml Literal Script Gr3.grxml Script Gr42.gram Script Google TechTalk – Mar 6th, 2009 Paolo Baggia 20
    21. SRGS/SISR Standards – Rich XML Results Section 7 of SISR 1.0 specification http://www.w3.org/TR/semantic-interpretation/#SI7 Serialization rules from SISR ECMA results into XML Edge cases: Arrays Special variable “_attribute” and “_value” Creation of namespaces and prefixes { drink: { _nsdecl: { _prefix:\"n1\", _name:\"http://www.example.com/n1\" }, _nsprefix:\"n1\", liquid: { _nsdecl: { <n1:drink xmlns:n1=\"http://www.example.com/n1\"> _prefix:\"n2\", <liquid n2:color=\"black“ _name:\"http://www.example.com/n2\" xmlns:n2=\"http://www.example.com/n2\">coke</liquid> }, _attributes: { <size>medium</size> color: { </n1:drink> _nsprefix:\"n2\", _value:\"black\" } }, _value:\"coke\" }, size:\"medium\" } } Google TechTalk – Mar 6th, 2009 Paolo Baggia 21
    22. SRGS/SISR Standards – Next Steps Adoption of the PLS 1.0 lexicon Clear entry point into PLS lexicons, <token> element Missing role attribute in <token> to allow homographs disambiguation Next extensions via Errata XML 1.1 support and IR Update normative references No Major Extensions are needed! Google TechTalk – Mar 6th, 2009 Paolo Baggia 22
    23. Speech Synthesis SSML 1.0/1.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 23
    24. TTS – Functional Architecture and Markup/Non-Markup support Text-to- Structure Text Prosody Waveform Phoneme Analysis Normalization Analysis Production Conversion Markup support: Markup support: Markup support: <phoneme>, <lexicon> <p>, <s> <voice>, <audio> Non-Markup support: Non-Markup support: Non-Markup support: look up in pronunciation infer the structure by dictionary automatic text analysis Markup support: Markup support: <emphasis>, <break>, <prosody> <say-as> for date, time, phone number, numbers Non-Markup support: <sub> for acronyms and transliterations automatically generate prosody through analysis of Non-Markup support: document structure and sentence syntax automatically identify and convert constructs http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
    25. SSML 1.0 – Language description (I) version attribute Document Structure SSML namespace attribute <speak> root element <?xml version=\"1.0\" encoding=\"ISO-8859-1\"?> <speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"en-US\"> <p>I don't speak Japanese.</p> <p xml:lang=\"ja\">Nihongo-ga wakarimasen.</p> Languages </speak> Processing and Pronunciation – <p> and <s> (paragraph and sentence) to give a structure to the text – <say-as> element to indicate the type of text construct contained within the element ex. date, numbers, etc. – <phoneme> element to provides a phonetic pronunciation for the contained text in IPA – <sub> element to provide substitutions for expanding acronyms in sequence of words http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 25
    26. SSML 1.0 – Language description (II) Style - <voice> element <?xml version=\"1.0\" encoding=\"ISO-8859-1\"?> <speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"en-US\"> The moon is raising on the beach, when John says, looking Mary in the eyes: <voice name=\"simon\">I love you!</voice> but she suddenly replies: <voice name=\"susan\"> Please, be serious! </voice> </speak> Other voice selection attributes are: name, xml:lang, gender, age, and variant - <emphasis> element requests that the contained text be spoken with emphasis level attribute can set it to strong, moderate, reduced, or none - <break> element controls the pausing between words time attribute with two kind of values: Time expressions “5s”, “20ms” strength attribute with values: none, x-weak, weak, medium (default value), strong, or x-strong http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
    27. SSML 1.0 – Language description (III) Prosody <prosody> element permits control of the pitch, speaking rate and volume of the speech output. The attributes are: volume: the volume for the contained text. rate: the speaking rate in words-per-minute for the contained text. duration: a value in seconds or milliseconds for the desired time to take to read the element contents. pitch: the baseline pitch for the contained text. range: the pitch range (variability) for the contained text in Hertz. contour: sets the actual pitch contour for the contained text. Other elements <audio> element - to play an audio file <mark> element - to place a marker into the text/tag sequence <desc> element - to provide a description of a non-speech audio source in <audio> http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
    28. Towards SSML 1.1 – Motivations Internationalization needs: Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07) Results: No major needs for Eastern and Western European languages Many issues for Far East languages (Mandarin, Japanese, Korean) Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many Indian languages Mark input with or without vowels Mark the transliteration schema used for input Extensions required by Voice Browser: More powerful error handling, selection of fall-back strategies Trimming attributes Volume attribute to adopt a logarithmic scale (before was linear) Alignment with PLS 1.0 specification for user lexicons http://www.w3.org/TR/speech-synthesis11/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 28
    29. SSML 1.1 – Language Changes <w> element Lexicon extensions <lookup> element permits control of the pitch, speaking rate and volume of the speech output. Phonetic Alphabet Registry creation and adoption \"ipa\" for International Phonetic Alphabet Registering policy for other phonetic alphabets, similar to LTRU for Language tags Candidates: PinYin for Mandarin Chinese JEITA for Japanese X-SAMPA, ASCII transliteration of IPA codes http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
    30. Pronunciation Lexicon PLS 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 30
    31. Pronunciation Lexicons Pronunciation Lexicon A mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine Pronunciation lexicons are not only useful for voice browsers They have also proven effective mechanisms to support accessibility for the differently able as well as greater usability for all users They are used to good effect in screen readers and user agents supporting multimodal interfaces The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is designed to enable interoperable specification of pronunciation lexicons http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
    32. PLS 1.0 – Language Overview A PLS document is a container (<lexicon>) of several lexical entries (<lexeme>) Each lexical entry contains One or more spellings (<grapheme>) One or more pronunciations (<phoneme>) or substitutions (<alias>) Each PLS document is related to a single unique language (xml:lang) SSML 1.0 and SRGS 1.0 documents can reference one or more PLS documents Current version doesn’t include morphological, syntactic and semantic information associated with pronunciations http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
    33. PLS 1.0 – An Example <?xml version=\"1.0\" encoding=\"UTF-8\"?> <lexicon version=\"1.0\" xmlns=\"http://www.w3.org/2005/01/pronunciation-lexicon\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/pronunciation-lexicon/pls.xsd\" alphabet=\"ipa\" xml:lang=\"en-US\"> <lexeme> <grapheme>Sepulveda</grapheme> ˈȜ Ǻ <phoneme>səˈpȜlvǺdə</phoneme> </lexeme> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
    34. PLS 1.0 – Used for TTS SSML 1.0 <?xml version=\"1.0\" encoding=\"UTF-8\"?> <speak version=\"1.0\" … xml:lang=\"en-US\"> <lexicon uri=\"http://www.example.com/SSMLexample.pls\"/> The title of the movie is: \"La vita è bella\" (Life is beautiful), which is directed by Benigni. </speak> PLS 1.0 <?xml version=\"1.0\" encoding=\"UTF-8\"?> <lexicon version=\"1.0\" … alphabet=\"ipa\" xml:lang=\"en-US\"> <lexeme> <grapheme>La vita è bella</grapheme> <phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme> ˈǡ ː Ǻǫ </lexeme> <lexeme> <grapheme>Benigni</grapheme> <phoneme>bǫˈniːnji</phoneme> ǫː </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
    35. PLS 1.0 – Used for ASR SRGS 1.0 <?xml version=\"1.0\" encoding=\"UTF-8\"?> <grammar version=\"1.0“ xml:lang=\"en-US\" root=\"movies\" mode=\"voice\"> <lexicon uri=\"http://www.example.com/SRGSexample.pls\"/> <rule id=\"movies\" scope=\"public\"> <one-of> <item>Terminator 2: Judgment Day</item> <item>Pluto's Judgement Day</item> </one-of> </rule> </grammar> PLS 1.0 <?xml version=\"1.0\" encoding=\"UTF-8\"?> <lexicon version=\"1.0\" … alphabet=\"ipa\" xml:lang=\"en-US\"> <lexeme> <grapheme>judgment</grapheme> <grapheme>judgement</grapheme> ˈȜ <phoneme>ˈdʒȜdʒ.mənt</phoneme> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
    36. Examples of Use Multiple pronunciations for the same orthography Multiple orthographies Homophones Homographs Acronyms, Abbreviations, etc. Detailed descriptions can be found in: W3C specification, Wikipedia Paolo Baggia, SpeechTEK 2008 & Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 36
    37. PLS 1.0 – Open Issues No wide support of IPA in speech engines Slowly changes are under way Phonetic Alphabet Registry will open doors to other alphabets in a controlled and interoperable way Integration in ASR/TTS SSML 1.1 will interoperate with PLS 1.0 SRGS 1.0 still missing support of role attribute for PLS 1.0 No matching algorithm inside PLS, because it is mainly a data format http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
    38. Pronunciation Alphabets IPA, SAMPA Google TechTalk – Mar 6th, 2009 Paolo Baggia 38
    39. International Phonetic Alphabet Pronunciation is represented by a phonetic alphabet Standard phonetic alphabets International Phonetic Alphabet (IPA) Well known phonetic alphabet SAMPA - ASCII based (simple to write) Pinyin (Chinese Mandarin), JEITA (Japanese), etc. Proprietary phonetic alphabets International Phonetic Alphabet (IPA) Created by International Phonetic Association (active since 1896), collaborative effort by all the major phoneticians around the world Universally agreed system of notation for sounds of languages Covers all languages Requires UNICODE to write it Normatively referenced by PLS Google TechTalk – Mar 6th, 2009 Paolo Baggia 39
    40. IPA – Chart IPA was founded in 1886 It is the major international association of phoneticians The IPA alphabet provides symbols making possible the phonemic transcription of all known languages IPA characters can be encoded in Unicode by supplementing ASCII with characters from other ranges, particularly: IPA extensions (0250–02AF) Latin Extended-A (0100-017F) See the detailed: http://www.unicode.org/charts Google TechTalk – Mar 6th, 2009 Paolo Baggia 40
    41. Phonetic Alphabets – Issues The real problem is how to write pronunciation in a reliable, unless you are trained phonetician Issues with fonts and authoring, browsers, but Unicode fonts today support IPA extensions, see: http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm There are very few tools to help writing pronunciations and to let you listen to what you have written Make available pronunciations in IPA or other general phonetic languages. Google TechTalk – Mar 6th, 2009 Paolo Baggia 41
    42. Voice Dialog languages: VoiceXML 2.0 VoiceXML 2.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 42
    43. VoiceXML 2.0 – Features, Elements Menus, forms, sub-dialogs Events <menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>, <catch>, <throw> Input Transition and submission Speech recognition <grammar> <goto>, <submit> Recording Telephony <record> Connection control Keypad <transfer>, <disconnect> <grammar mode=\"dtmf\"> Telephony information Output Platform specifics Audio files <object> <audio> Performance Text-To-Speech Fetch <prompt> Properties Variables (ECMA-262) <var>, <assign>, <script> scoping rules http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 43
    44. VoiceXML 2.0 – Execution Model Execution is synchronous Only disconnect event is handled (somewhat) asynchronous Execution is always in a single dialog: <form> or <menu> Form Interpretation Algorithm for <field> selection Prompt are queued Played only when encountering a waiting state Played before a fetchaudio is started Processing is always in one of two states: Waiting for input in an input item: <field>, <record>, <transfer>, etc. Transitioning between input items in response of an input Event-driven: user’s input event handling <nomatch>, <noinput> generalized event mechanism <catch>, <throw> call event handling connection.* error event handling error.* http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
    45. VoiceXML 2.1 – Extended Features Dynamically referencing grammars and scripts: <grammar expr=\"…\">, <script expr=\"…\"> Record user’s utterance during form filling recordutterance property Add new shadow variables: recording, recordingsize, recordingduration Detect barge-in during prompt playback (SSML <mark>) Add markexpr attribute Add new shadow variables: markname and marktime Fetch XML data without transition Use read-only subset of DOM Dynamically concatenate prompts <foreach> Iterate throught ECMAScript arrays and execute content Send data upon disconnect <disconnect namelist=\"…\"> Additional transfer type <transfer type=\"consultation\"> http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 45
    46. VoiceXML Applications Static VoiceXML applications The VoiceXML page is always the same, so the user experience No personalization or customization Dynamic VoiceXML applications User experience is customized • After authentication (PIN) • Using caller-id or SIP-id Data driven Dynamic pages generated at runtime e.g. JSP, ASP, etc. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
    47. A Drawback of VoiceXML 2.0 A drawback of VoiceXML is that the transition from a VoiceXML page to another is a costly activity: Fetch the new page, if not cached Parse the page Initialize the context, possibly loading and initializing a new application root document Load or pre-compile scripts The transitions are the only way to return data to the Web Application (if the VoiceXML is dynamic) Pages must be created to include dynamic data VoiceXML 2.1 addresses part of this drawback by feeding dynamic data to a running VoiceXML page http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
    48. Advantages of VoiceXML 2.1 - AJAX Two of the eight new features in VoiceXML 2.1 helps to create more dynamic VoiceXML applications: <data> element <foreach> element Static VoiceXML document can fetch user-specific data at runtime, without changing the VoiceXML document <data> element allows retrieval of arbitrary XML data without VoiceXML document transitions Returned XML data are accessible by a subset of DOM primitives <foreach> extend the prompts to allow the iteration on a dynamic array of information to create a dynamic prompt This is similar to AJAX programming for HTML services It decouples presentation layer (VoiceXML) from business logic (accessed via <data>) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 48
    49. VoiceXML 2.1 – <data> Element Attributes: the variable to be filled with the DOM of the retrieved data name scr or srcexpr the URI of the location of the XML data the list of variables to be submitted namelist either ‘get’ or ‘post’ method media encoding enctype fetch and caching attributes As <var>, it may appear in executable content (<form> and <vxml>) The value of name must be a declared variable The platform will fill the variable of the DOM of the fetched XML data <data> element is synchronous (the service stops to get data) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
    50. VoiceXML 2.1 – <foreach> Element Attributes: ECMAScript expression that must evaluate to ECMAScript array array the variable that stores the element to be processed item <foreach> allows the application to iterate on an ECMAScript array and to execute the content <foreach> may appear: In executable content (all executable content elements may appear as content of <foreach>) In <prompt> (restrictions on the content are applied) <foreach> allows sophisticated concatenation of prompts http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
    51. VoiceXML – Final Remarks The changed landscape for speech application development: Virtually all the IVRs today support VoiceXML New options related to VoiceXML: SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie) Large hosting of speech applications (TellMe, Voxeo) Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.) Further changes may come from the CCXML adoption … but: Mainly system driven applications are actually deployed New challenges to incorporate more powerful dialog strategies, mixed-initiative are under discussion. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
    52. VoiceXML Resources Voice Browser Working Group (spec, FAQ, implementations, resources): http://www.w3.org/Voice/ VoiceXML Forum site (resources, education, interest groups): http://www.voicexml.org/ VoiceXML Forum Review: http://www.voicexmlreview.org/ Interesting articles related to VoiceXML and more Example code in the sections \"First Words\" and \"Speak & Listen\" Ken Rehor’s World of VoiceXML http://www.kenrehor.com/voicexml Online documentation related to VoiceXML Platforms Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie Many books on VoiceXML: Jim Larson, \"VoiceXML Introduction to Developing Speech Applications\", Prentice-Hall, 2002. A. Hocek, D. Cuddihy, \"Definitive VoiceXML\", Prentice-Hall, 2002 Google TechTalk – Mar 6th, 2009 Paolo Baggia 52
    53. Call Control: CCXML 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 53
    54. CCXML 1.0 – Highlights Asynchronous event processing Acceptance or refusal of an incoming call Different type of transfer call management Outbound call activation (interaction with an external entity) Use of ECMAScript adding scripting capabilities to call control applications VoiceXML modularization Conferencing management Google TechTalk – Mar 6th, 2009 Paolo Baggia 54
    55. CCXML 1.0 – Elements Relationship Google TechTalk – Mar 6th, 2009 Paolo Baggia 55
    56. CCXML 1.0 – Incoming Call CCXML document Event catching and processing <?xml version=\"1.0\" encoding=\"UTF-8\"?> <ccxml version=\"1.0\"> […] <transition CCXML connection.alerting event=\"connection.alerting\"> Interpreter […] </transition> event$ <transition event=\"connection.disconnected\"> […] name:’connection.alerting’; </transition> connectionid:‘0239023901903993’; eventid:’00001’; .... ….. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 56
    57. CCXML 1.0 – connection.alerting Event Basic telephony information has been retrieved on alerting event and is available into CCXML document: Local URI, remote URI, protocol used, redirection info, etc. Based on certain checked info, CCXML can accept or refuse the incoming call, even before contacting the dialog server; Any error that can occur during the phone call can be managed by CCXML service (connection.failed, error.connection events) Call Control CCXML VoiceXML Adapter Interpreter Interpreter connection.alerting Analyzing events$ content <accept/> | <reject/> http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 57
    58. CCXML 1.0 – How to activate a new dialog CCXML actions: Receives alerting event from Call Control Adapter Asks to dialog server to prepare a new dialog Waits for the preparation If the dialog has been successfully prepared, accept the call Asks to dialog server to start the prepared new dialog CCXML Call Control VoiceXML Interpreter Adapter Interpreter alerting prepare a new dialog dialog prepared call accepted connected start the prepared dialog dialog started Google TechTalk – Mar 6th, 2009 Paolo Baggia 58
    59. Call transfer CCXML supports transfer call of different modality: \"bridge\", \"blind\", \"consultation\"; Based on different modalities features CCXML language allows the expected interaction with the Call Control Adapter to correctly perform the transfer; During the different phases of transfer call creation the CCXML can receive any asynchronous event and correctly manage it, interrupting the call, if requested CCXML Call Control VoiceXML Interpreter Adapter Interpreter Performing a transfer command1 answer1 […] transfer complete … Google TechTalk – Mar 6th, 2009 Paolo Baggia 59
    60. External Events CCXML Interpreter Context can receive events from an external entity able to use the HTTP protocol; Events generated in this way must be sent to a CCXML by a POST HTTP command A event is so performed and: It can be addressed on a new session whose creation must be requested It can be addressed on an existent session, specifying the ID in the request CCXML External Interpreter Entity basic http event Event management Event management result http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 60
    61. External event on a new session: the Outbound Call A particular request arrived to Call Control from an external entity; A particular CCXML service associated with the received event is started and a set of operations between Call Control Adapter, Call Control and Dialog Server is activated: the outbound call is so placed outbound call request Call Control CCXML VoiceXML Adapter Interpreter Interpreter Create a call connection progressing … Prepare a dialog prepared connection connected Start the prepared dialog Google TechTalk – Mar 6th, 2009 Paolo Baggia 61
    62. External event on a session: dialog termination request An external entity performs a HTTP POST request towards the CCXML Interpreter Context, specifying a sessionid, requesting the termination of a particular dialog; The CCXML check the session id, if this is valid then CCXML Interpreter injects the event received in the session; The CCXML service has a transition on that event and performs the dialog termination on a particular dialog identifier; Dialog termination request Call Control VoiceXML CCXML Adapter Interpreter Interpreter It depends on dialogterminate (dialogid) dialog.exit event management dialog.exit disconnect(connId) dialogprepare Google TechTalk – Mar 6th, 2009 Paolo Baggia 62
    63. Loading different CCXML documents: <fetch> and <goto> elements <fetch> and <goto> elements are used respectively to asynchronously fetch content identified by the attributes of the <fetch> and to go in a fetched document, if it’s successfully loaded; CCXML - MODULARIZATION - SOURCE EXEMPLIFICATION Interpreter - MORE READABILITY <fetch next=\"'http://../Fetch/doc1.ccxml'\" type=\"'application/ccxml+xml'\" fetchid=\"result\"/> fetch the document \"doc1.ccxml\" fetch.done / error.fetch The first event occurred in a new document is ccxml.loaded goto into the new document / continue to work on the same dialog http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 63
    64. Simple CCXML Document <?xml version=\"1.0\" encoding=\"UTF-8\"?> <ccxml version=\"1.0\" xmlns=\"http://www.w3.org/2002/09/ccxml\"> <var name=\"currentState\"/> <var name=\"myDialogId\"/> <var name=\"myConnId\"/> <eventprocessor statevariable=\"currentState\"> <transition event=\"connection.alerting\"> <assign name=\"myConnId\" expr=\"event$.connectionid\"/> <accept connectionid=\"event$.connectionid\"/> </transition> <transition event=\"connection.connected\"> <dialogstart src=\"'http://www.example.com/flight.vxml'\" connectionid=\"myConnId\" dialogid=\"myDialogId\"/> </transition> <transition event=\"dialog.started\"> <log expr=\"’VoiceXML appl is running now’\"/> </transition> <transition event=\"connection.disconnected\"> <dialogterminate dialogid=\"myDialogId\"/> </transition> <transition event=\"dialog.exit\"> <disconnect connectionid=\"myConnId\"/> </transition> <transition event=\"*\"> <log expr=\"'Closing, unexpected:'+ event$.name\"/> <exit/> </transition> </eventprocessor> </ccxml> Google TechTalk – Mar 6th, 2009 Paolo Baggia 64
    65. CCXML 1.0 – Next Steps CCXML specification is a Last Call Working Draft, all the feature requests and clarifications have been addressed; An Implementation Report test suite is under development; It is very close to be published as W3C Candidate Recommendation; Internal or external companies will be invited to send implementation report on their CCXML platform; After that, CCXML 1.0 specification will be able to become Proposed Recommendation and then W3C Recommendation. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 65
    66. Speech Interface Framework Tour Complete! Google TechTalk – Mar 6th, 2009 Paolo Baggia 66
    67. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 67
    68. Architectural Changes .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 68
    69. VoxNauta – Internal Architecture Google TechTalk – Mar 6th, 2009 Paolo Baggia 69
    70. Loquendo MRCP Server/LSS 7.0 Architecture Load Balancer RTSP SIP MRCP v2 (MRCPv1) (SDP) RTP SIP RTSP Parser MRCP v2 parser SDP MRCP v1 Parser Management Graphic MP (SNMP) Management Configuration Consolle Config files AP MRCP v1/v2 Server Interf. Logger Log files Audio AP API Provider Win32/Linux OS NLSML / EMMA TTS & ASR interface TTS and ASR API TTS and ASR API LASR-SV LASR LTTS Google TechTalk – Mar 6th, 2009 Paolo Baggia 70
    71. IETF MRCP Protocols Media Resource Control Protocol MRCP are IETF standards MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on RTSP/RTP MRCPv2 is Internet Draft, http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP offering the new audio recording and Speaker Verification functionalities Optimized client-server solution for the large-scale deployment of speech technologies in the telephony field, such as call centers, CRM, news and email-reading, self-service applications, etc. Allows standard interface of speech technologies in all IVR platforms For more information read: Dave Burke, Speech Processing for IP Networks. Media Resource Control Protocol (MRCP), ed. Wiley Google TechTalk – Mar 6th, 2009 Paolo Baggia 71
    72. VoiceXML in a Call Center PBX Fixed/ Optional Mobile Network Voice Gateway for Non SIP PBX VOXNAUTA IVR ACD WEB CTI Data Server Server Server Operators Google TechTalk – Mar 6th, 2009 Paolo Baggia 72
    73. VoiceXML in the IMS Architecture TDM protocols VOICE SIP protocols Fixed/ RTP GATEWAY Mobile VoiceXML on HTTPS Network VOXNAUTA MRF IP Network Application Server Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
    74. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
    75. Modes, Modalities and Technologies Speech Audio Stylus Touch Accelerometer Keyboard/keypad Mouse/touchpad Camera Geolocation Handwriting recognition Speaker verification Signature verification Fingerprint identification …. Google TechTalk – Mar 6th, 2009 Paolo Baggia 75
    76. Complement and Supplement Speech Visual - Transient - Persistent - Linear - Spatial - Hands and Eyes-Free - Eyes - Suffers Noise - Suffers Light Conditions Enable to choose among different modalities or to mix them Adaptable to different social, environmental conditions or to user preference Google TechTalk – Mar 6th, 2009 Paolo Baggia 76
    77. GUI VUI MUI or MMUI Google TechTalk – Mar 6th, 2009 Paolo Baggia 77
    78. MMI has an Intrinsic Complexity Interaction Manager speech speech fingerprint text fingerprint text Face mouse Face mouse identification identification geolocation handwriting geolocation handwriting Speaker Speaker verification Vital verification accelerometer Vital accelerometer signs signs Sensor Identification User intent video video photograph photograph Audio Audio drawing drawing recording recording Deborah Dahl, Voice Search 2009 Recording Google TechTalk – Mar 6th, 2009 Paolo Baggia 78
    79. MMI can Include Many Different Technologies Touchscreen Accelerometer Interaction Speech Geolocation recognition Manager Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 79
    80. Uniform Representation for MMI Getting everything to work together is complicated. One simplification is to represent the same information from different modalities in the same format. The need a common language for representing the same information from different modalities EMMA (Extensible MultiModal Annotation) 1.0 A uniform representation for multimodal information Google TechTalk – Mar 6th, 2009 Paolo Baggia 80
    81. Touchscreen Accelerometer EMMA EMMA Interaction Speech EMMA EMMA Geolocation recognition Manager EMMA EMMA EMMA Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 81
    82. EMMA Structural Elements EMMA Elements Provide containers for application semantics and for multimodal annotation emma:emma <emma:emma …> emma:interpretation <emma:one-of> <emma:interpretation> emma:one-of … </emma:interpretation> <emma:interpretation> emma:group … </emma:interpretation> emma:sequence </emma:one-of> </emma:emma> emma:lattice http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
    83. EMMA Annotations Characteristics and processing of input, e.g.: token of input emma:tokens reference to processing emma:process lack of input emma:no-input uninterpretable input emma:uninterpreted human language of input emma:lang emma:signal reference to signal emma:media-type media type emma:confidence confidence scores emma:source annotation of input source emma:start emma:end Timestamps (absolute/relative) emma:medium emma:mode medium, mode, and emma:function function of input emma:hook hook http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
    84. EMMA 1.0 – Example Travel Application INPUT: \"I want to go from Boston to Denver on March 11\" http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 84
    85. EMMA 1.0 – Same meaning <emma:interpretation medium=\"acoustic\" mode=\"voice\" id=\"int1\"> <origin>Boston</origin> Speech <destination>Denver</destination> <date>11032009</date> </emma:interpretation> <emma:interpretation medium=\"tactile\" mode=\"gui“ id=\"int1\"> <origin>Boston</origin> Mouse <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 85
    86. EMMA 1.0 – Handwriting Input <emma:interpretation medium=\"tactile\" mode=\"ink\" id=\"int1\"> <origin>Boston</origin> <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 86
    87. EMMA 1.0 – Biometrics Input <emma:emma version=\"1.0\"> <emma:emma version=\"1.0\"> <emma:interpretation <emma:interpretation id=\"int1\" id=\"int1\" emma:confidence=\".75\" emma:confidence=\".80\" emma:medium=\"visual\" emma:medium=\"acoustic\" emma:mode=\"photograph\" emma:mode=\"voice\" emma:verbal=\"false\" emma:verbal=\"false\" emma:function=\"identification\"> emma:function=\"identification\"> <person>12345</person> <person>12345</person> <name>Mary Smith</name> <name>Mary Smith</name> </emma:interpretation> </emma:interpretation> </emma:emma> </emma:emma> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 87
    88. EMMA 1.0 – Representing Lattices Speech recognizers, Handwriting recognizers and other input processing components may provide lattice output: A graph encoding a range of possible recognition results or interpretations portland today please from flights to austin 7 1 2 3 4 5 6 8 oakland tomorrow boston From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 88
    89. EMMA 1.0 – Representing Lattices Lattices can be represented using EMMA elements: <emma:lattice emma:initial=\"?\" emma:final=\"?\"> <emma:arc emma:from=\"?\" emma:to=\"?\"> <emma:emma version=\"1.0\" xmlns:emma=\"http://www.w3.org/2003/04/emma\"> <emma:interpretation> <emma:lattice emma:initial=\"1\" emma:final=\"8\"> <emma:arc emma:from=\"1\" emma:to=\"2\">flights</emma:arc> <emma:arc emma:from=\"2\" emma:to=\"3\">to</emma:arc> <emma:arc emma:from=\"3\" emma:to=\"4\">boston</emma:arc> <emma:arc emma:from=\"3\" emma:to=\"4\">austin</emma:arc> <emma:arc emma:from=\"4\" emma:to=\"5\">from</emma:arc> <emma:arc emma:from=\"5\" emma:to=\"6\">portland</emma:arc> <emma:arc emma:from=\"5\" emma:to=\"6\">oakland</emma:arc> <emma:arc emma:from=\"6\" emma:to=\"7\">today</emma:arc> <emma:arc emma:from=\"7\" emma:to=\"8\">please</emma:arc> <emma:arc emma:from=\"6\" emma:to=\"8\">tomorrow</emma:arc> </emma:lattice> </emma:interpretation> </emma:emma> From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
    90. EMMA in Multimodal Framework http://www.w3.org/TR/mmi-framework EMMA Google TechTalk – Mar 6th, 2009 Paolo Baggia 90
    91. InkML 1.0 – Digital Ink Ink Markup Language (InkML), http://www.w3.org/TR/InkML Data format for presenting digital Ink (pen, stylus, etc) Allows the input and processing of handwritings, gesture, sketches, music, etc. <ink> <trace> 10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140, 13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135, 58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205 </trace> <trace> 130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125, 152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200, 150 208, 163 210, 178 208, 192 201, 205 192, 214 180 </trace> <trace> 227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134, 230 148, 234 162, 235 176, 238 190, 241 204 </trace> <trace> 282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129, 291 143, 294 157, 294 171, 294 185, 296 199, 300 213 </trace> <trace> 366 130, 359 143, 354 157, 349 171, 352 185, 359 197, 371 204, 385 205, 398 202, 408 191, 413 177, 413 163, 405 150, 392 143, 378 141, 365 150 </trace> </ink> http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 91
    92. InkML 1.0 – Status and Advances Rich annotation for Ink: Trace, Trace formats and Trace collections Contextual information Canvases Etc. Result of classification of InkML traces may be a semantic representation in EMMA 1.0 Current status is Last Call Working Draft, next will be Candidate Recommendation with release of an Impl. Report test-suite Raising interest from major industries http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
    93. MMI Architecture Specification “Multimodal Architecture and Interfaces“, W3C Working Draft, http://www.w3.org/TR/mmi-arch/ Runtime Framework provides Delivery Interaction Data the basic infrastructure and Context Manager Component Component controls communication among the constituents. Runtime Framework Interaction Manager (IM) Modality Component API coordinates Modality Components (MCs) by life-cycle Modality Modality events and contains the shared Component 1 Component N data (context). Event-based communication between IM and MCs. http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 93
    94. MMI Arch – Laboratory Implementation Implementation of components using W3C markup languages. Delivery Interaction Data Context Manager Component Component SCXML Runtime Framework Modality Component API Modality Component API HTML VoiceXML Modality Modality Component 1 Component N for GUI for VUI http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 94
    95. MMI Arch – Laboratory Implementation SCXML based Interaction Manager. VoiceXML + HTML modality components. SCXML interpreter Server HTTP I/O Processor Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA) CCXML/VoiceXML Server Browser HTML Browser Telephony interface Client Phone Client GUI modality component Voice modality component http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 95
    96. MMI Architecture – Open Issues Profiles Start-up, Registration, Delegation in distributed environment Transport of Events Extensibility of Events http://www.w3.org/TR/mmi-arch/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 96
    97. Emotion in Wikipedia From Wikipedia definition: “An emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviours. It is a prime determinant of the sense of subjective well-being and appears to play a central role in many human activities. As a result of this generality, the subject has been explored in many, if not all of the human sciences and art forms. There is much controversy concerning how emotions are defined and classified.” General goal: Make interaction between humans and machines more natural for the humans Machines should become able: • to register human emotions (and related states) • to convey emotions (and related states) • to “understand” the emotional relevance of events Google TechTalk – Mar 6th, 2009 Paolo Baggia 97
    98. Emotional States are Numerous Active bellicose adventurous hostile AROUSED TENSE hateful lusting ASTONISHE ALARMED envious triumphant D ANGRY AFRAID Obstructive EXCITED enraged defiant Hi Power/Control self- ANNOYED ambitious conceited confident contemptuo angry Angry us feeling jealous courageous Scherer et al. superior DISTRESS disgusted ED indignant convinced loathing FRUSTRATED Univ. Geneva DELIGHTEenthusiasti elated light- impatient c discontente D suspicious bitter hearted d determined amused excited insulted HAPPY joyous passionate Happy distrustful expectant interested bored startled Positive Negative feel well impressed disappointe PLEASED d amourous astonished apathetic MISERABL GLAD dissatisfied E confident taken aback content hopeful worried uncomfortaSAD relaxed longing feel guilt ble solemn attentive DEPRESS despondent SERENE languid ashamed desperate ED GLOOMY CONTENT Sad AT EASEfriendly pensive SATISFIED contemplati embarrass RELAXED polite serious ve ed CALM wavering lonely melancholi hesitant Lo Power/Control c Conducive BORED peaceful anxious conscientio sad dejected insecure us empathic DROOPY reverent doubtful SLEEPY TIRED Passive Google TechTalk – Mar 6th, 2009 Paolo Baggia 98
    99. HUMAINE project HUMAINE project European Network of Excellence Activity: 01/2004 - 12/2007 33 partner institutions from many disciplines Today: HUMAINE Association (since June 2007) 125 members Web-site: http://emotion-research.net Google TechTalk – Mar 6th, 2009 Paolo Baggia 99
    100. Online Speaker Classification Classification Techniques Principal Component Analysis (PCA) or Support Vector Machines (SVM): use “kernel Linear Discriminant Analysis (LDA) – functions” to separate non-linear decision preprocessing step to reduce feature vector boundaries dimension Classification and Regression Trees (CART) K-nearest Neighbor Hidden Markov Models (HMMs) used to Gaussian Mixture Models: model training model temporal structure data as Gaussian densities Artificial Neural Networks (ANN), e.g. MLP: interesting training algorithms Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4.8.2008, Seite 1. Google TechTalk – Mar 6th, 2009 Paolo Baggia 100
    101. Expressive TTS – Two Approaches Text+expressive tags 1. Different speech style 1 databases, one for each expressive style: Waveform Selection style 2 Effective solution, feasible only for a very limited range of emotions style n Text+expressive tags 2. Speech signal manipulation according to style dependent prosodic models Prosodic Model Signal neutral Waveform Flexible solution, but Processing style requires accurate models Selection and effective signal processing capabilities From Enrico Zovato, Loquendo Google TechTalk – Mar 6th, 2009 Paolo Baggia 101
    102. Expressive TTS – Example Prosodic Patterns Synthesis of two basic emotional styles through prosodic modification: different intonation contours different acoustic units duration 500 POS (“happy”) 400 Frequency (Hz) NEG (“sad”) 300 200 100 0 0 1.8 Time (s) POS NEG Male-UK From Enrico Zovato, Loquendo Female-UK Google TechTalk – Mar 6th, 2009 Paolo Baggia 102
    103. Emotions in ECAs From Piero Cosi, CNR, Padova Google TechTalk – Mar 6th, 2009 Paolo Baggia 103
    104. W3C Emotion Incubator “The W3C Incubator Activity fosters rapid development, on a time scale of a year or less, of new Web-related concepts. Target concepts include innovative ideas for specifications, guidelines, and applications that are not (or not yet) clear candidates as Web standards developed through the more thorough process afforded by the W3C Recommendation Track.” W3C Emotion Incubator Aims: First Charter XG (2006-2007): “...to investigate the prospects of defining a general-purpose Emotion annotation and representation language...” “...which should be usable in a large variety of technological contexts where emotions need to be represented.” Second Charter XG (Nov. 2007 – Nov. 2008): Prioritize the requirements; Release a first specification draft; Illustrate how to combine the Emotion Markup Language with existing markup languages. Google TechTalk – Mar 6th, 2009 Paolo Baggia 104
    105. W3C Emotion Incubator – Members Chairman: Marc Schröder, DFKI W3C Members: Invited Experts: DFKI Emotion AI Loquendo Univ. Paris 8 Deutsche Telekom Uuniv. Basque Country SRI International Univ. C. Cork NTUA OFAI, Austria Fraunhofer IPCA, Portugal Chinese Acad. Science Tech.Univ. Munich Web space: http://www.w3.org/2005/Incubator/emotion Results: • Use case description document • Requirements document • Final Report (20 Nov 2008): Elements of an EmotionML 1.0 http://www.w3.org/2005/Incubator/emotion/XGR-emotionml/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 105
    106. W3C Emotion Incubator – EmotionML 1.0 Document structure: container element (<emotionml>), single emotion annotation (<emotion>) Representation of emotions: <category> element, <dimensions> element, <appraisals> element, <action-tendency> element, <intensity> element Meta information: confidence attribute, <modality> element, <metadata> element Links and time: <link> element, <timing> element Scale values value attribute, <traces> element Google TechTalk – Mar 6th, 2009 Paolo Baggia 106
    107. EmotionML 1.0 – Examples Expression of emotions in SSML 1.1: <?xml version=\"1.0\"?> <speak version=\"1.1\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xmlns:emo=\"http://www.w3.org/2008/11/emotionml\" xml:lang=\"en-US\"> <s> <emo:emotion> <emo:category set=\"everydayEmotions\" name=\"doubt\"/> <emo:intensity value=\"0.4\"/> </emo:emotion> Do you need help? </s> </speak> Detection of emotions in EMMA 1.0: <emma:emma version=\"1.0\" xmlns:emma=\"http://www.w3.org/2003/04/emma\" xmlns=\"http://www.w3.org/2008/11/emotionml\"> <emma:interpretation start=\"12457990\" end=\"12457995\" mode=\"voice\" verbal=\"false\"> <emotion> <intensity value=\"0.1\" confidence=\"0.8\"/> <category set=\"everydayEmotions\" name=\"boredom\" confidence=\"0.1\"/> </emotion> </emma:interpretation> </emma:emma> Google TechTalk – Mar 6th, 2009 Paolo Baggia 107
    108. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 108
    109. W3C VBWG/MMIWG – Next Future Spec for the next generation of Voice Browsing SCXML 1.0 VoiceXML 3.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 109
    110. State Charts - SCXML State Chart XML (SCXML): http://www.w3.org/TR/2008/WD-scxml-20080516/ Powerful State-Machine Language Based on David Harel’s State Charts (see his book) Adopted by in UML Standard under development by W3C VBWG http://www.w3.org/TR/scxml/ States, Transitions, Events Data model extends basic finite state automaton Conditions on transitions Nested States Represents task decomposition In multiple dependent states at same time Parallel States Represent fork/join logic Wide interest: VBWG, MMI WG, Other W3C groups, Universities, Industries Already available Open Source Implementations Google TechTalk – Mar 6th, 2009 Paolo Baggia 110
    111. SCXML 1.0 – Parallel State Charts Google TechTalk – Mar 6th, 2009 Paolo Baggia 111
    112. SCXML as MMI Interaction Manager SCXML Interaction Manager Vo i ce Mo d alit ality y Mod ure Gest Visual Modality Google TechTalk – Mar 6th, 2009 Paolo Baggia 112
    113. SCXML for VoiceXML 3.0 SCXML Interaction Manager Vo i ce Mo d alit ality y Mod ure Gest Visual Modality Google TechTalk – Mar 6th, 2009 Paolo Baggia 113
    114. SCXML 1.0 – Open Issues Data model: ECMA Script (ECMA-262) or other formats? Definition of Profiles Other Google TechTalk – Mar 6th, 2009 Paolo Baggia 114
    115. Re-Thinking VoiceXML – VoiceXML 3.0 Well-founded: From syntactic description to a semantic model Extensible: SIV, EMMA support, rich media, VCR control, etc. Profiled: light profile (mobile?), media profile (scalability), VoiceXML 2.1 profile (interoperability), etc. Flexibility: Customization of FIA (Form Interpretation Algorithm) Google TechTalk – Mar 6th, 2009 Paolo Baggia 115
    116. VoiceXML 3.0 – Separation of Concerns SCXML 1.0 Application and interaction logic VoiceXML 3.0: Voice Interaction only, under control of SCXML VoiceXML 3.0 has been published as a First Working Draft, http://www.w3.org/TR/2008/WD-voicexml30-20081219/ Send public comments Google TechTalk – Mar 6th, 2009 Paolo Baggia 116
    117. THANK YOU for clarifications or questions: paolo.baggia@loquendo.com Google TechTalk – Mar 6th, 2009 Paolo Baggia 117
    118. THANK YOU For more information please: Keep an eye on: www.loquendo.com Loquendo S.p.A. 745 Fifth Ave, 27th Floor Contact: paolo.baggia@loquendo.com New York, NY 10151 USA Tel. +1 212.310.9075 Keep in touch with Loquendo news, subscribe to Fax. +1 212.310.9001 the Loquendo Newsletter www.loquendo.com Try our interactive TTS demo: insert your text, Loquendo S.p.A. choose a language, and listen Via Olivetti, 6 The latest News at a click 10148 TORINO Consult the Loquendo Newsletter online Italy Tel. +39 011 291 3111 Keep up to date on events and initiatives Fax +39 011 291 3199 For further information, fill in our Contacts Form www.loquendo.com Google TechTalk – Mar 6th, 2009 Paolo Baggia 118

    + GoogleTecTalksGoogleTecTalks, 6 months ago

    custom

    456 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 456
      • 456 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 32
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories