Voice Browser and Multimodal Interaction In 2009


   Paolo Baggia
   Director of International Standards

   March 6th, 2...
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          ...
Company Profile

    Privately held company (fully owned by Telecom Italia), founded in 2001 as
    spin-off from Telecom ...
International Awards

                   “2008 Frost & Sullivan European Telematics and Infotainment
                   Em...
A Bit of History




Google TechTalk – Mar 6th, 2009                      Paolo Baggia   5
Standard Bodies
      Two main standard bodies:
      W3C – World Wide Web Consortium
               Founded in 1994, by T...
The (r)evolution of VoiceXML
 1998 - 2004



            W3C charters
                                                W3C ...
Speech Interface Framework in 2000
 (by Jim Larson)


                          Semantic Interpretation for
              ...
Speech Interface Framework - Today
 (by Jim Larson)

                           Semantic Interpretation for
              ...
Speech Interface Framework - End of 2009
 (by Jim Larson)

                           Semantic Interpretation for
        ...
W3C Process




Google TechTalk – Mar 6th, 2009   Paolo Baggia   11
Architectural Changes

           Traditional (proprietary) architecture


                                  ASR / DTMF
  ...
The VoiceXML Impact

       VoiceXML changed the landscape of IVRs and speech application
       creation
         From pr...
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          ...
Standards for ASR and DTMF
                   SRGS 1.0, SISR 1.0




Google TechTalk – Mar 6th, 2009         Paolo Baggia ...
W3C Standards for Speech/DTMF Grammars


                                                        SEMANTICS
               ...
SRGS/SISR Grammars for “Torino”


                          SRGS XML                                 SRGS ABNF

          ...
SRGS/SISR Standards – Pros

      Powerful syntax (CFG) and very powerful semantics (ECMA)
      DMTF and Voice input are ...
SRGS/SISR Standards – Small Issues

        Semantics declaration: tag-format attribute
            If value “semantics/1....
SRGS/SISR – Encapsulated Grammars



                                  Gr2.gram
                                   Literal...
SRGS/SISR Standards – Rich XML Results
   Section 7 of SISR 1.0 specification
          http://www.w3.org/TR/semantic-inte...
SRGS/SISR Standards – Next Steps

      Adoption of the PLS 1.0 lexicon
           Clear entry point into PLS lexicons, <t...
Speech Synthesis
                               SSML 1.0/1.1




Google TechTalk – Mar 6th, 2009                 Paolo Bag...
TTS – Functional Architecture and
Markup/Non-Markup support

                                                     Text-to-...
SSML 1.0 – Language description (I)
                                                        version attribute
        Docu...
SSML 1.0 – Language description (II)
       Style
         - <voice> element
             <?xml version=quot;1.0quot; enco...
SSML 1.0 – Language description (III)

          Prosody
            <prosody> element
              permits control of th...
Towards SSML 1.1 – Motivations

    Internationalization needs:
         Three Workshops: Beijing (Nov’05), Crete (May’06)...
SSML 1.1 – Language Changes

        <w> element

        Lexicon extensions
          <lookup> element
            permit...
Pronunciation Lexicon
                              PLS 1.0




Google TechTalk – Mar 6th, 2009                 Paolo Bagg...
Pronunciation Lexicons

    Pronunciation Lexicon
      A mapping between words (or short phrases), their written represen...
PLS 1.0 – Language Overview

      A PLS document is a container (<lexicon>) of several lexical entries
      (<lexeme>)

...
PLS 1.0 – An Example

  <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
  <lexicon version=quot;1.0quot;
   xmlns=q...
PLS 1.0 – Used for TTS

SSML 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<speak version=quot;1.0quot; … xml...
PLS 1.0 – Used for ASR

SRGS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar version=quot;1.0“ xml:lan...
Examples of Use

    Multiple pronunciations for the same orthography

    Multiple orthographies

    Homophones

    Hom...
PLS 1.0 – Open Issues


      No wide support of IPA in speech engines
           Slowly changes are under way
           ...
Pronunciation Alphabets
                          IPA, SAMPA




Google TechTalk – Mar 6th, 2009                Paolo Bagg...
International Phonetic Alphabet

    Pronunciation is represented by a phonetic alphabet
         Standard phonetic alphab...
IPA – Chart
   IPA was founded in 1886
   It is the major international
        association of phoneticians
   The IPA alp...
Phonetic Alphabets – Issues


      The real problem is how to write pronunciation in a reliable, unless
      you are tra...
Voice Dialog languages:
                          VoiceXML 2.0
                          VoiceXML 2.1




Google TechTalk ...
VoiceXML 2.0 – Features, Elements

   Menus, forms, sub-dialogs         Events
     <menu>, <form>, <subdialog>      <noma...
VoiceXML 2.0 – Execution Model

      Execution is synchronous
           Only disconnect event is handled (somewhat) asyn...
VoiceXML 2.1 – Extended Features
    Dynamically referencing grammars and scripts:
      <grammar expr=quot;…quot;>, <scri...
VoiceXML Applications

     Static VoiceXML applications
         The VoiceXML page is always the same, so the user experi...
A Drawback of VoiceXML 2.0

    A drawback of VoiceXML is that the transition from a VoiceXML page
    to another is a cos...
Advantages of VoiceXML 2.1 - AJAX

    Two of the eight new features in VoiceXML 2.1 helps to create
    more dynamic Voic...
VoiceXML 2.1 – <data> Element

      Attributes:
                        the variable to be filled with the DOM of the ret...
VoiceXML 2.1 – <foreach> Element

    Attributes:
                   ECMAScript expression that must evaluate to ECMAScrip...
VoiceXML – Final Remarks

       The changed landscape for speech application development:
            Virtually all the I...
VoiceXML Resources

   Voice Browser Working Group (spec, FAQ, implementations, resources):
       http://www.w3.org/Voice...
Call Control:
                                  CCXML 1.0




Google TechTalk – Mar 6th, 2009                   Paolo Bagg...
CCXML 1.0 – Highlights


     Asynchronous event processing

     Acceptance or refusal of an incoming call

     Differen...
CCXML 1.0 – Elements Relationship




Google TechTalk – Mar 6th, 2009     Paolo Baggia   55
CCXML 1.0 – Incoming Call
                                                             CCXML document
Event catching and p...
CCXML 1.0 – connection.alerting Event

      Basic telephony information has been retrieved on alerting event and
      is...
CCXML 1.0 – How to activate a new dialog
CCXML actions:
  Receives alerting event from Call Control Adapter
  Asks to dial...
Call transfer

    CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;,
    quot;consult...
External Events

    CCXML Interpreter Context can receive events from an external entity
    able to use the HTTP protoco...
External event on a new session:
the Outbound Call

    A particular request arrived to Call Control from an external enti...
External event on a session:
dialog termination request
    An external entity performs a HTTP POST request towards the CC...
Loading different CCXML documents:
     <fetch> and <goto> elements

     <fetch> and <goto> elements are used respectivel...
Simple CCXML Document
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<ccxml version=quot;1.0quot; xmlns=quot;http:...
CCXML 1.0 – Next Steps

    CCXML specification is a Last Call Working Draft, all the feature
    requests and clarificati...
Speech Interface Framework
                     Tour Complete!




Google TechTalk – Mar 6th, 2009          Paolo Baggia  ...
Speech Interface Framework - End of 2009
 (by Jim Larson)

                           Semantic Interpretation for
        ...
Architectural Changes




                                                 .grxml/.gram, .pls
        VoiceXML architectur...
VoxNauta – Internal Architecture




Google TechTalk – Mar 6th, 2009    Paolo Baggia   69
Loquendo MRCP Server/LSS 7.0 Architecture

                                             Load Balancer

                   ...
IETF MRCP Protocols


        Media Resource Control Protocol MRCP are IETF standards
              MRCPv1 is RFC 4463, ht...
VoiceXML in a Call Center
                                           PBX


              Fixed/
                          ...
VoiceXML in the IMS Architecture


                                                             TDM protocols
            ...
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          ...
Modes, Modalities and Technologies


     Speech
     Audio
     Stylus
     Touch
     Accelerometer
     Keyboard/keypad...
Complement and Supplement


                  Speech                      Visual
         - Transient                   - ...
GUI                    VUI   MUI
                                         or
                                        MMUI
...
MMI has an Intrinsic Complexity

                                              Interaction
                               ...
MMI can Include Many Different Technologies




                          Touchscreen                 Accelerometer




  ...
Uniform Representation for MMI



        Getting everything to work together is complicated.
        One simplification i...
Touchscreen                       Accelerometer


                                                           EMMA
        ...
EMMA Structural Elements


                                         EMMA Elements
  Provide containers for application
  s...
EMMA Annotations

  Characteristics and processing of input, e.g.:
                                                       ...
EMMA 1.0 – Example Travel Application




   INPUT:
   quot;I want to go from Boston
    to Denver on March 11quot;




  ...
EMMA 1.0 – Same meaning


 <emma:interpretation medium=quot;acousticquot; mode=quot;voicequot;
   id=quot;int1quot;>
     ...
EMMA 1.0 – Handwriting Input

   <emma:interpretation medium=quot;tactilequot; mode=quot;inkquot;
     id=quot;int1quot;>
...
EMMA 1.0 – Biometrics Input

<emma:emma version=quot;1.0quot;>               <emma:emma version=quot;1.0quot;>
   <emma:in...
EMMA 1.0 – Representing Lattices


     Speech recognizers, Handwriting recognizers and other input
     processing compon...
EMMA 1.0 – Representing Lattices
    Lattices can be represented using EMMA elements:
     <emma:lattice emma:initial=quot...
EMMA in Multimodal Framework
 http://www.w3.org/TR/mmi-framework




                                  EMMA




Google Tec...
InkML 1.0 – Digital Ink

Ink Markup Language (InkML), http://www.w3.org/TR/InkML
   Data format for presenting digital Ink...
InkML 1.0 – Status and Advances

    Rich annotation for Ink:
         Trace, Trace formats and Trace collections
        ...
MMI Architecture Specification

“Multimodal Architecture and Interfaces“, W3C Working Draft,
    http://www.w3.org/TR/mmi-...
MMI Arch – Laboratory Implementation


    Implementation of components using W3C markup languages.



                   ...
MMI Arch – Laboratory Implementation

     SCXML based Interaction Manager.
     VoiceXML + HTML modality components.



 ...
MMI Architecture – Open Issues


        Profiles

        Start-up, Registration, Delegation
        in distributed envir...
Emotion in Wikipedia

 From Wikipedia definition:

     “An emotion is a mental and physiological state associated with a
...
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Upcoming SlideShare
Loading in …5
×

Voice Browsing And Multimodal Interaction In 2009

2,297 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,297
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
115
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Voice Browsing And Multimodal Interaction In 2009

  1. 1. Voice Browser and Multimodal Interaction In 2009 Paolo Baggia Director of International Standards March 6th, 2009 Google TechTalk Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
  2. 2. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
  3. 3. Company Profile Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing. Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices. Multilingual, proprietary technologies protected over 100 patents worldwide Munich London Financially robust, break-even reached in 2004, revenues and earnings growing year on year Paris Growth-plan investment approved for the evolution of products and services. Madrid Offices in New York. Headquarters in Torino, Torino local representative sales offices in Rome, New York Rome Madrid, Paris, London, Munich Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers. Google TechTalk – Mar 6th, 2009 Paolo Baggia 3
  4. 4. International Awards “2008 Frost & Sullivan European Telematics and Infotainment Emerging Company of the Year” Award Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008 Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award “Best Innovation in Automotive Speech Synthesis” Prize AVIOS-SpeechTEK West 2007 “Best Innovation in Expressive Speech Synthesis” Prize AVIOS-SpeechTEK West 2006 “Best Innovation in Multi-Lingual Speech Synthesis” Prize AVIOS-SpeechTEK West 2005 Google TechTalk – Mar 6th, 2009 Paolo Baggia 4
  5. 5. A Bit of History Google TechTalk – Mar 6th, 2009 Paolo Baggia 5
  6. 6. Standard Bodies Two main standard bodies: W3C – World Wide Web Consortium Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan). 400 members all over the world, 50 Working, Interest and Coordination Groups. W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence) IETF – Internet Engineering Task Force Founded in 1986, but growth in 1991as Internet Society. 1300 members. HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms. Two industrial forums: VoiceXML Forum (www.voicexml.org) Inventors of VoiceXML 1.0, then submitted to W3C for standardization. Current goal is to promote, disseminate and support VoiceXML and related standards. SALT Forum (www.saltforum.org) Supported by Microsoft to define a lightweight markup for telephony and multimodal applications. Other relevant bodies: 3GPP, OMA, ETSI, NIST Google TechTalk – Mar 6th, 2009 Paolo Baggia 6
  7. 7. The (r)evolution of VoiceXML 1998 - 2004 W3C charters W3C charters Voice Browser Multimodal Interaction WG WG EMMA 1.0 By Cisco, Comverse, VoiceXML W3C Rec SALT Forum Intel, Microsoft, Philips, Forum Birth Birth SpeechWorks, PLS 1.0 By AT&T, IBM, W3C REC Lucent, Motorola, 2007 2004 2000 1998 2009 2008 1999 2002 SSML 1.0 W3C Voice SISR 1.0 W3C Rec SRGS 1.0 Browser W3C Rec VoiceXML 1.0 W3C Rec VoiceXML 2.0 VoiceXML 2.0 Workshop Released W3C Rec W3C Rec Preparing to announce VoiceXML 1.0 Friday Feb. 25th, 2000 Lucent, Naperville, Illinois Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM), Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola). Google TechTalk – Mar 6th, 2009 Paolo Baggia 7
  8. 8. Speech Interface Framework in 2000 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 8
  9. 9. Speech Interface Framework - Today (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 9
  10. 10. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 10
  11. 11. W3C Process Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
  12. 12. Architectural Changes Traditional (proprietary) architecture ASR / DTMF Speech Proprietary User SCE Applic. TTS / Audio Proprietary platform .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 12
  13. 13. The VoiceXML Impact VoiceXML changed the landscape of IVRs and speech application creation From proprietary to standard-based speech applications Before After • Standard VoiceXML • Proprietary platforms platforms (HW & SW) • Standards for Speech • Proprietary Technologies applications (by proprietary SCE) • Standard tools for VoiceXML applications • Mainly DTMF and pre-recorded prompts • Integration of DTMF and ASR • First attempts to add speech into IVR • Still predominance of DTMF, but more and more speech applications Google TechTalk – Mar 6th, 2009 Paolo Baggia 13
  14. 14. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
  15. 15. Standards for ASR and DTMF SRGS 1.0, SISR 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 15
  16. 16. W3C Standards for Speech/DTMF Grammars SEMANTICS SYNTAX Speech Defines constraints on Describes how to admissible sentences for grammar produce results after a specific recognition turn an utterance is recognized SRGS SISR SRGS SISR ABNF XML literal script ABNF XML literal script voice dtmf voice dtmf http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 16
  17. 17. SRGS/SISR Grammars for “Torino” SRGS XML SRGS ABNF <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar xml:lang=quot;en-USquot; version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/06/grammarquot; #ABNF 1.0 iso-8859-1; tag-format=quot;semantics/1.0-literalsquot;> SISR mode voice; tag-format <semantics/1.0-literals>; <rule id=quot;mainquot; scope=quot;publicquot;> <token>Torino</token> literal <tag>10100</tag> public $main = Torino {10100} ; </rule> </grammar> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar xml:lang=quot;en-USquot; version=quot;1.0quot; #ABNF 1.0 iso-8859-1; xmlns=quot;http://www.w3.org/2001/06/grammar quot; tag-format=quot;semantics/1.0quot;> mode voice; SISR tag-format <semantics/1.0>; <tag>var unused=7;</tag> <rule id=quot;mainquot; scope=quot;publicquot;> script {var unused=7;}; <token>Torino</token> public $main = Torino {out=quot;10100quot;;} ; <tag>out=quot;10100quot;;</tag> </rule> </grammar> Google TechTalk – Mar 6th, 2009 Paolo Baggia 17
  18. 18. SRGS/SISR Standards – Pros Powerful syntax (CFG) and very powerful semantics (ECMA) DMTF and Voice input are transparent to the application Wide and consistent adoption among technology vendors Two syntax XML and ABNF are great! Developers can choose (XML validation vs. compact format) Transformations are possible XML ABNF (easy, simple XSLT) ABNF XML (requires a ABNF parser) Open Source tools might be created to: Validate grammar syntax Transform grammars Debug grammars on written input Coverage tests: explode covered sentences, GenSem, SemTester, etc. Google TechTalk – Mar 6th, 2009 Paolo Baggia 18
  19. 19. SRGS/SISR Standards – Small Issues Semantics declaration: tag-format attribute If value “semantics/1.0”? Mandate SISR Script semantics inside semantic tags If value “semantics/1.0-literal”? Mandate SISR Literal semantics inside semantic tags If missing? Unclear! Risk of interoperability troubles SISR Script Semantics Clumsy default assignment: returns last referenced rule only Developer must properly pop-up results Be careful to redefine “out” Assign a scalar value might result in errors SISR Literal Semantics Only useful for very simple word-list rules No support for encapsulating rules SISR Literal grammars as external references ONLY! Google TechTalk – Mar 6th, 2009 Paolo Baggia 19
  20. 20. SRGS/SISR – Encapsulated Grammars Gr2.gram Literal Gr41.grxml Gr1.grxml Literal Script Gr3.grxml Script Gr42.gram Script Google TechTalk – Mar 6th, 2009 Paolo Baggia 20
  21. 21. SRGS/SISR Standards – Rich XML Results Section 7 of SISR 1.0 specification http://www.w3.org/TR/semantic-interpretation/#SI7 Serialization rules from SISR ECMA results into XML Edge cases: Arrays Special variable “_attribute” and “_value” Creation of namespaces and prefixes { drink: { _nsdecl: { _prefix:quot;n1quot;, _name:quot;http://www.example.com/n1quot; }, _nsprefix:quot;n1quot;, liquid: { _nsdecl: { <n1:drink xmlns:n1=quot;http://www.example.com/n1quot;> _prefix:quot;n2quot;, <liquid n2:color=quot;black“ _name:quot;http://www.example.com/n2quot; xmlns:n2=quot;http://www.example.com/n2quot;>coke</liquid> }, _attributes: { <size>medium</size> color: { </n1:drink> _nsprefix:quot;n2quot;, _value:quot;blackquot; } }, _value:quot;cokequot; }, size:quot;mediumquot; } } Google TechTalk – Mar 6th, 2009 Paolo Baggia 21
  22. 22. SRGS/SISR Standards – Next Steps Adoption of the PLS 1.0 lexicon Clear entry point into PLS lexicons, <token> element Missing role attribute in <token> to allow homographs disambiguation Next extensions via Errata XML 1.1 support and IR Update normative references No Major Extensions are needed! Google TechTalk – Mar 6th, 2009 Paolo Baggia 22
  23. 23. Speech Synthesis SSML 1.0/1.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 23
  24. 24. TTS – Functional Architecture and Markup/Non-Markup support Text-to- Structure Text Prosody Waveform Phoneme Analysis Normalization Analysis Production Conversion Markup support: Markup support: Markup support: <phoneme>, <lexicon> <p>, <s> <voice>, <audio> Non-Markup support: Non-Markup support: Non-Markup support: look up in pronunciation infer the structure by dictionary automatic text analysis Markup support: Markup support: <emphasis>, <break>, <prosody> <say-as> for date, time, phone number, numbers Non-Markup support: <sub> for acronyms and transliterations automatically generate prosody through analysis of Non-Markup support: document structure and sentence syntax automatically identify and convert constructs http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
  25. 25. SSML 1.0 – Language description (I) version attribute Document Structure SSML namespace attribute <speak> root element <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?> <speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;> <p>I don't speak Japanese.</p> <p xml:lang=quot;jaquot;>Nihongo-ga wakarimasen.</p> Languages </speak> Processing and Pronunciation – <p> and <s> (paragraph and sentence) to give a structure to the text – <say-as> element to indicate the type of text construct contained within the element ex. date, numbers, etc. – <phoneme> element to provides a phonetic pronunciation for the contained text in IPA – <sub> element to provide substitutions for expanding acronyms in sequence of words http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 25
  26. 26. SSML 1.0 – Language description (II) Style - <voice> element <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?> <speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;> The moon is raising on the beach, when John says, looking Mary in the eyes: <voice name=quot;simonquot;>I love you!</voice> but she suddenly replies: <voice name=quot;susanquot;> Please, be serious! </voice> </speak> Other voice selection attributes are: name, xml:lang, gender, age, and variant - <emphasis> element requests that the contained text be spoken with emphasis level attribute can set it to strong, moderate, reduced, or none - <break> element controls the pausing between words time attribute with two kind of values: Time expressions “5s”, “20ms” strength attribute with values: none, x-weak, weak, medium (default value), strong, or x-strong http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
  27. 27. SSML 1.0 – Language description (III) Prosody <prosody> element permits control of the pitch, speaking rate and volume of the speech output. The attributes are: volume: the volume for the contained text. rate: the speaking rate in words-per-minute for the contained text. duration: a value in seconds or milliseconds for the desired time to take to read the element contents. pitch: the baseline pitch for the contained text. range: the pitch range (variability) for the contained text in Hertz. contour: sets the actual pitch contour for the contained text. Other elements <audio> element - to play an audio file <mark> element - to place a marker into the text/tag sequence <desc> element - to provide a description of a non-speech audio source in <audio> http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
  28. 28. Towards SSML 1.1 – Motivations Internationalization needs: Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07) Results: No major needs for Eastern and Western European languages Many issues for Far East languages (Mandarin, Japanese, Korean) Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many Indian languages Mark input with or without vowels Mark the transliteration schema used for input Extensions required by Voice Browser: More powerful error handling, selection of fall-back strategies Trimming attributes Volume attribute to adopt a logarithmic scale (before was linear) Alignment with PLS 1.0 specification for user lexicons http://www.w3.org/TR/speech-synthesis11/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 28
  29. 29. SSML 1.1 – Language Changes <w> element Lexicon extensions <lookup> element permits control of the pitch, speaking rate and volume of the speech output. Phonetic Alphabet Registry creation and adoption quot;ipaquot; for International Phonetic Alphabet Registering policy for other phonetic alphabets, similar to LTRU for Language tags Candidates: PinYin for Mandarin Chinese JEITA for Japanese X-SAMPA, ASCII transliteration of IPA codes http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
  30. 30. Pronunciation Lexicon PLS 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 30
  31. 31. Pronunciation Lexicons Pronunciation Lexicon A mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine Pronunciation lexicons are not only useful for voice browsers They have also proven effective mechanisms to support accessibility for the differently able as well as greater usability for all users They are used to good effect in screen readers and user agents supporting multimodal interfaces The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is designed to enable interoperable specification of pronunciation lexicons http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
  32. 32. PLS 1.0 – Language Overview A PLS document is a container (<lexicon>) of several lexical entries (<lexeme>) Each lexical entry contains One or more spellings (<grapheme>) One or more pronunciations (<phoneme>) or substitutions (<alias>) Each PLS document is related to a single unique language (xml:lang) SSML 1.0 and SRGS 1.0 documents can reference one or more PLS documents Current version doesn’t include morphological, syntactic and semantic information associated with pronunciations http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
  33. 33. PLS 1.0 – An Example <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; xmlns=quot;http://www.w3.org/2005/01/pronunciation-lexiconquot; xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot; xsi:schemaLocation=quot;http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/pronunciation-lexicon/pls.xsdquot; alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>Sepulveda</grapheme> ˈȜ Ǻ <phoneme>səˈpȜlvǺdə</phoneme> </lexeme> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
  34. 34. PLS 1.0 – Used for TTS SSML 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <speak version=quot;1.0quot; … xml:lang=quot;en-USquot;> <lexicon uri=quot;http://www.example.com/SSMLexample.plsquot;/> The title of the movie is: quot;La vita è bellaquot; (Life is beautiful), which is directed by Benigni. </speak> PLS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>La vita è bella</grapheme> <phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme> ˈǡ ː Ǻǫ </lexeme> <lexeme> <grapheme>Benigni</grapheme> <phoneme>bǫˈniːnji</phoneme> ǫː </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
  35. 35. PLS 1.0 – Used for ASR SRGS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar version=quot;1.0“ xml:lang=quot;en-USquot; root=quot;moviesquot; mode=quot;voicequot;> <lexicon uri=quot;http://www.example.com/SRGSexample.plsquot;/> <rule id=quot;moviesquot; scope=quot;publicquot;> <one-of> <item>Terminator 2: Judgment Day</item> <item>Pluto's Judgement Day</item> </one-of> </rule> </grammar> PLS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>judgment</grapheme> <grapheme>judgement</grapheme> ˈȜ <phoneme>ˈdʒȜdʒ.mənt</phoneme> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
  36. 36. Examples of Use Multiple pronunciations for the same orthography Multiple orthographies Homophones Homographs Acronyms, Abbreviations, etc. Detailed descriptions can be found in: W3C specification, Wikipedia Paolo Baggia, SpeechTEK 2008 & Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 36
  37. 37. PLS 1.0 – Open Issues No wide support of IPA in speech engines Slowly changes are under way Phonetic Alphabet Registry will open doors to other alphabets in a controlled and interoperable way Integration in ASR/TTS SSML 1.1 will interoperate with PLS 1.0 SRGS 1.0 still missing support of role attribute for PLS 1.0 No matching algorithm inside PLS, because it is mainly a data format http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
  38. 38. Pronunciation Alphabets IPA, SAMPA Google TechTalk – Mar 6th, 2009 Paolo Baggia 38
  39. 39. International Phonetic Alphabet Pronunciation is represented by a phonetic alphabet Standard phonetic alphabets International Phonetic Alphabet (IPA) Well known phonetic alphabet SAMPA - ASCII based (simple to write) Pinyin (Chinese Mandarin), JEITA (Japanese), etc. Proprietary phonetic alphabets International Phonetic Alphabet (IPA) Created by International Phonetic Association (active since 1896), collaborative effort by all the major phoneticians around the world Universally agreed system of notation for sounds of languages Covers all languages Requires UNICODE to write it Normatively referenced by PLS Google TechTalk – Mar 6th, 2009 Paolo Baggia 39
  40. 40. IPA – Chart IPA was founded in 1886 It is the major international association of phoneticians The IPA alphabet provides symbols making possible the phonemic transcription of all known languages IPA characters can be encoded in Unicode by supplementing ASCII with characters from other ranges, particularly: IPA extensions (0250–02AF) Latin Extended-A (0100-017F) See the detailed: http://www.unicode.org/charts Google TechTalk – Mar 6th, 2009 Paolo Baggia 40
  41. 41. Phonetic Alphabets – Issues The real problem is how to write pronunciation in a reliable, unless you are trained phonetician Issues with fonts and authoring, browsers, but Unicode fonts today support IPA extensions, see: http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm There are very few tools to help writing pronunciations and to let you listen to what you have written Make available pronunciations in IPA or other general phonetic languages. Google TechTalk – Mar 6th, 2009 Paolo Baggia 41
  42. 42. Voice Dialog languages: VoiceXML 2.0 VoiceXML 2.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 42
  43. 43. VoiceXML 2.0 – Features, Elements Menus, forms, sub-dialogs Events <menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>, <catch>, <throw> Input Transition and submission Speech recognition <grammar> <goto>, <submit> Recording Telephony <record> Connection control Keypad <transfer>, <disconnect> <grammar mode=quot;dtmfquot;> Telephony information Output Platform specifics Audio files <object> <audio> Performance Text-To-Speech Fetch <prompt> Properties Variables (ECMA-262) <var>, <assign>, <script> scoping rules http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 43
  44. 44. VoiceXML 2.0 – Execution Model Execution is synchronous Only disconnect event is handled (somewhat) asynchronous Execution is always in a single dialog: <form> or <menu> Form Interpretation Algorithm for <field> selection Prompt are queued Played only when encountering a waiting state Played before a fetchaudio is started Processing is always in one of two states: Waiting for input in an input item: <field>, <record>, <transfer>, etc. Transitioning between input items in response of an input Event-driven: user’s input event handling <nomatch>, <noinput> generalized event mechanism <catch>, <throw> call event handling connection.* error event handling error.* http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
  45. 45. VoiceXML 2.1 – Extended Features Dynamically referencing grammars and scripts: <grammar expr=quot;…quot;>, <script expr=quot;…quot;> Record user’s utterance during form filling recordutterance property Add new shadow variables: recording, recordingsize, recordingduration Detect barge-in during prompt playback (SSML <mark>) Add markexpr attribute Add new shadow variables: markname and marktime Fetch XML data without transition Use read-only subset of DOM Dynamically concatenate prompts <foreach> Iterate throught ECMAScript arrays and execute content Send data upon disconnect <disconnect namelist=quot;…quot;> Additional transfer type <transfer type=quot;consultationquot;> http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 45
  46. 46. VoiceXML Applications Static VoiceXML applications The VoiceXML page is always the same, so the user experience No personalization or customization Dynamic VoiceXML applications User experience is customized • After authentication (PIN) • Using caller-id or SIP-id Data driven Dynamic pages generated at runtime e.g. JSP, ASP, etc. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
  47. 47. A Drawback of VoiceXML 2.0 A drawback of VoiceXML is that the transition from a VoiceXML page to another is a costly activity: Fetch the new page, if not cached Parse the page Initialize the context, possibly loading and initializing a new application root document Load or pre-compile scripts The transitions are the only way to return data to the Web Application (if the VoiceXML is dynamic) Pages must be created to include dynamic data VoiceXML 2.1 addresses part of this drawback by feeding dynamic data to a running VoiceXML page http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
  48. 48. Advantages of VoiceXML 2.1 - AJAX Two of the eight new features in VoiceXML 2.1 helps to create more dynamic VoiceXML applications: <data> element <foreach> element Static VoiceXML document can fetch user-specific data at runtime, without changing the VoiceXML document <data> element allows retrieval of arbitrary XML data without VoiceXML document transitions Returned XML data are accessible by a subset of DOM primitives <foreach> extend the prompts to allow the iteration on a dynamic array of information to create a dynamic prompt This is similar to AJAX programming for HTML services It decouples presentation layer (VoiceXML) from business logic (accessed via <data>) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 48
  49. 49. VoiceXML 2.1 – <data> Element Attributes: the variable to be filled with the DOM of the retrieved data name scr or srcexpr the URI of the location of the XML data the list of variables to be submitted namelist either ‘get’ or ‘post’ method media encoding enctype fetch and caching attributes As <var>, it may appear in executable content (<form> and <vxml>) The value of name must be a declared variable The platform will fill the variable of the DOM of the fetched XML data <data> element is synchronous (the service stops to get data) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
  50. 50. VoiceXML 2.1 – <foreach> Element Attributes: ECMAScript expression that must evaluate to ECMAScript array array the variable that stores the element to be processed item <foreach> allows the application to iterate on an ECMAScript array and to execute the content <foreach> may appear: In executable content (all executable content elements may appear as content of <foreach>) In <prompt> (restrictions on the content are applied) <foreach> allows sophisticated concatenation of prompts http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
  51. 51. VoiceXML – Final Remarks The changed landscape for speech application development: Virtually all the IVRs today support VoiceXML New options related to VoiceXML: SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie) Large hosting of speech applications (TellMe, Voxeo) Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.) Further changes may come from the CCXML adoption … but: Mainly system driven applications are actually deployed New challenges to incorporate more powerful dialog strategies, mixed-initiative are under discussion. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
  52. 52. VoiceXML Resources Voice Browser Working Group (spec, FAQ, implementations, resources): http://www.w3.org/Voice/ VoiceXML Forum site (resources, education, interest groups): http://www.voicexml.org/ VoiceXML Forum Review: http://www.voicexmlreview.org/ Interesting articles related to VoiceXML and more Example code in the sections quot;First Wordsquot; and quot;Speak & Listenquot; Ken Rehor’s World of VoiceXML http://www.kenrehor.com/voicexml Online documentation related to VoiceXML Platforms Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie Many books on VoiceXML: Jim Larson, quot;VoiceXML Introduction to Developing Speech Applicationsquot;, Prentice-Hall, 2002. A. Hocek, D. Cuddihy, quot;Definitive VoiceXMLquot;, Prentice-Hall, 2002 Google TechTalk – Mar 6th, 2009 Paolo Baggia 52
  53. 53. Call Control: CCXML 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 53
  54. 54. CCXML 1.0 – Highlights Asynchronous event processing Acceptance or refusal of an incoming call Different type of transfer call management Outbound call activation (interaction with an external entity) Use of ECMAScript adding scripting capabilities to call control applications VoiceXML modularization Conferencing management Google TechTalk – Mar 6th, 2009 Paolo Baggia 54
  55. 55. CCXML 1.0 – Elements Relationship Google TechTalk – Mar 6th, 2009 Paolo Baggia 55
  56. 56. CCXML 1.0 – Incoming Call CCXML document Event catching and processing <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <ccxml version=quot;1.0quot;> […] <transition CCXML connection.alerting event=quot;connection.alertingquot;> Interpreter […] </transition> event$ <transition event=quot;connection.disconnectedquot;> […] name:’connection.alerting’; </transition> connectionid:‘0239023901903993’; eventid:’00001’; .... ….. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 56
  57. 57. CCXML 1.0 – connection.alerting Event Basic telephony information has been retrieved on alerting event and is available into CCXML document: Local URI, remote URI, protocol used, redirection info, etc. Based on certain checked info, CCXML can accept or refuse the incoming call, even before contacting the dialog server; Any error that can occur during the phone call can be managed by CCXML service (connection.failed, error.connection events) Call Control CCXML VoiceXML Adapter Interpreter Interpreter connection.alerting Analyzing events$ content <accept/> | <reject/> http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 57
  58. 58. CCXML 1.0 – How to activate a new dialog CCXML actions: Receives alerting event from Call Control Adapter Asks to dialog server to prepare a new dialog Waits for the preparation If the dialog has been successfully prepared, accept the call Asks to dialog server to start the prepared new dialog CCXML Call Control VoiceXML Interpreter Adapter Interpreter alerting prepare a new dialog dialog prepared call accepted connected start the prepared dialog dialog started Google TechTalk – Mar 6th, 2009 Paolo Baggia 58
  59. 59. Call transfer CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;, quot;consultationquot;; Based on different modalities features CCXML language allows the expected interaction with the Call Control Adapter to correctly perform the transfer; During the different phases of transfer call creation the CCXML can receive any asynchronous event and correctly manage it, interrupting the call, if requested CCXML Call Control VoiceXML Interpreter Adapter Interpreter Performing a transfer command1 answer1 […] transfer complete … Google TechTalk – Mar 6th, 2009 Paolo Baggia 59
  60. 60. External Events CCXML Interpreter Context can receive events from an external entity able to use the HTTP protocol; Events generated in this way must be sent to a CCXML by a POST HTTP command A event is so performed and: It can be addressed on a new session whose creation must be requested It can be addressed on an existent session, specifying the ID in the request CCXML External Interpreter Entity basic http event Event management Event management result http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 60
  61. 61. External event on a new session: the Outbound Call A particular request arrived to Call Control from an external entity; A particular CCXML service associated with the received event is started and a set of operations between Call Control Adapter, Call Control and Dialog Server is activated: the outbound call is so placed outbound call request Call Control CCXML VoiceXML Adapter Interpreter Interpreter Create a call connection progressing … Prepare a dialog prepared connection connected Start the prepared dialog Google TechTalk – Mar 6th, 2009 Paolo Baggia 61
  62. 62. External event on a session: dialog termination request An external entity performs a HTTP POST request towards the CCXML Interpreter Context, specifying a sessionid, requesting the termination of a particular dialog; The CCXML check the session id, if this is valid then CCXML Interpreter injects the event received in the session; The CCXML service has a transition on that event and performs the dialog termination on a particular dialog identifier; Dialog termination request Call Control VoiceXML CCXML Adapter Interpreter Interpreter It depends on dialogterminate (dialogid) dialog.exit event management dialog.exit disconnect(connId) dialogprepare Google TechTalk – Mar 6th, 2009 Paolo Baggia 62
  63. 63. Loading different CCXML documents: <fetch> and <goto> elements <fetch> and <goto> elements are used respectively to asynchronously fetch content identified by the attributes of the <fetch> and to go in a fetched document, if it’s successfully loaded; CCXML - MODULARIZATION - SOURCE EXEMPLIFICATION Interpreter - MORE READABILITY <fetch next=quot;'http://../Fetch/doc1.ccxml'quot; type=quot;'application/ccxml+xml'quot; fetchid=quot;resultquot;/> fetch the document quot;doc1.ccxmlquot; fetch.done / error.fetch The first event occurred in a new document is ccxml.loaded goto into the new document / continue to work on the same dialog http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 63
  64. 64. Simple CCXML Document <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <ccxml version=quot;1.0quot; xmlns=quot;http://www.w3.org/2002/09/ccxmlquot;> <var name=quot;currentStatequot;/> <var name=quot;myDialogIdquot;/> <var name=quot;myConnIdquot;/> <eventprocessor statevariable=quot;currentStatequot;> <transition event=quot;connection.alertingquot;> <assign name=quot;myConnIdquot; expr=quot;event$.connectionidquot;/> <accept connectionid=quot;event$.connectionidquot;/> </transition> <transition event=quot;connection.connectedquot;> <dialogstart src=quot;'http://www.example.com/flight.vxml'quot; connectionid=quot;myConnIdquot; dialogid=quot;myDialogIdquot;/> </transition> <transition event=quot;dialog.startedquot;> <log expr=quot;’VoiceXML appl is running now’quot;/> </transition> <transition event=quot;connection.disconnectedquot;> <dialogterminate dialogid=quot;myDialogIdquot;/> </transition> <transition event=quot;dialog.exitquot;> <disconnect connectionid=quot;myConnIdquot;/> </transition> <transition event=quot;*quot;> <log expr=quot;'Closing, unexpected:'+ event$.namequot;/> <exit/> </transition> </eventprocessor> </ccxml> Google TechTalk – Mar 6th, 2009 Paolo Baggia 64
  65. 65. CCXML 1.0 – Next Steps CCXML specification is a Last Call Working Draft, all the feature requests and clarifications have been addressed; An Implementation Report test suite is under development; It is very close to be published as W3C Candidate Recommendation; Internal or external companies will be invited to send implementation report on their CCXML platform; After that, CCXML 1.0 specification will be able to become Proposed Recommendation and then W3C Recommendation. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 65
  66. 66. Speech Interface Framework Tour Complete! Google TechTalk – Mar 6th, 2009 Paolo Baggia 66
  67. 67. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 67
  68. 68. Architectural Changes .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 68
  69. 69. VoxNauta – Internal Architecture Google TechTalk – Mar 6th, 2009 Paolo Baggia 69
  70. 70. Loquendo MRCP Server/LSS 7.0 Architecture Load Balancer RTSP SIP MRCP v2 (MRCPv1) (SDP) RTP SIP RTSP Parser MRCP v2 parser SDP MRCP v1 Parser Management Graphic MP (SNMP) Management Configuration Consolle Config files AP MRCP v1/v2 Server Interf. Logger Log files Audio AP API Provider Win32/Linux OS NLSML / EMMA TTS & ASR interface TTS and ASR API TTS and ASR API LASR-SV LASR LTTS Google TechTalk – Mar 6th, 2009 Paolo Baggia 70
  71. 71. IETF MRCP Protocols Media Resource Control Protocol MRCP are IETF standards MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on RTSP/RTP MRCPv2 is Internet Draft, http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP offering the new audio recording and Speaker Verification functionalities Optimized client-server solution for the large-scale deployment of speech technologies in the telephony field, such as call centers, CRM, news and email-reading, self-service applications, etc. Allows standard interface of speech technologies in all IVR platforms For more information read: Dave Burke, Speech Processing for IP Networks. Media Resource Control Protocol (MRCP), ed. Wiley Google TechTalk – Mar 6th, 2009 Paolo Baggia 71
  72. 72. VoiceXML in a Call Center PBX Fixed/ Optional Mobile Network Voice Gateway for Non SIP PBX VOXNAUTA IVR ACD WEB CTI Data Server Server Server Operators Google TechTalk – Mar 6th, 2009 Paolo Baggia 72
  73. 73. VoiceXML in the IMS Architecture TDM protocols VOICE SIP protocols Fixed/ RTP GATEWAY Mobile VoiceXML on HTTPS Network VOXNAUTA MRF IP Network Application Server Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
  74. 74. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
  75. 75. Modes, Modalities and Technologies Speech Audio Stylus Touch Accelerometer Keyboard/keypad Mouse/touchpad Camera Geolocation Handwriting recognition Speaker verification Signature verification Fingerprint identification …. Google TechTalk – Mar 6th, 2009 Paolo Baggia 75
  76. 76. Complement and Supplement Speech Visual - Transient - Persistent - Linear - Spatial - Hands and Eyes-Free - Eyes - Suffers Noise - Suffers Light Conditions Enable to choose among different modalities or to mix them Adaptable to different social, environmental conditions or to user preference Google TechTalk – Mar 6th, 2009 Paolo Baggia 76
  77. 77. GUI VUI MUI or MMUI Google TechTalk – Mar 6th, 2009 Paolo Baggia 77
  78. 78. MMI has an Intrinsic Complexity Interaction Manager speech speech fingerprint text fingerprint text Face mouse Face mouse identification identification geolocation handwriting geolocation handwriting Speaker Speaker verification Vital verification accelerometer Vital accelerometer signs signs Sensor Identification User intent video video photograph photograph Audio Audio drawing drawing recording recording Deborah Dahl, Voice Search 2009 Recording Google TechTalk – Mar 6th, 2009 Paolo Baggia 78
  79. 79. MMI can Include Many Different Technologies Touchscreen Accelerometer Interaction Speech Geolocation recognition Manager Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 79
  80. 80. Uniform Representation for MMI Getting everything to work together is complicated. One simplification is to represent the same information from different modalities in the same format. The need a common language for representing the same information from different modalities EMMA (Extensible MultiModal Annotation) 1.0 A uniform representation for multimodal information Google TechTalk – Mar 6th, 2009 Paolo Baggia 80
  81. 81. Touchscreen Accelerometer EMMA EMMA Interaction Speech EMMA EMMA Geolocation recognition Manager EMMA EMMA EMMA Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 81
  82. 82. EMMA Structural Elements EMMA Elements Provide containers for application semantics and for multimodal annotation emma:emma <emma:emma …> emma:interpretation <emma:one-of> <emma:interpretation> emma:one-of … </emma:interpretation> <emma:interpretation> emma:group … </emma:interpretation> emma:sequence </emma:one-of> </emma:emma> emma:lattice http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
  83. 83. EMMA Annotations Characteristics and processing of input, e.g.: token of input emma:tokens reference to processing emma:process lack of input emma:no-input uninterpretable input emma:uninterpreted human language of input emma:lang emma:signal reference to signal emma:media-type media type emma:confidence confidence scores emma:source annotation of input source emma:start emma:end Timestamps (absolute/relative) emma:medium emma:mode medium, mode, and emma:function function of input emma:hook hook http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
  84. 84. EMMA 1.0 – Example Travel Application INPUT: quot;I want to go from Boston to Denver on March 11quot; http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 84
  85. 85. EMMA 1.0 – Same meaning <emma:interpretation medium=quot;acousticquot; mode=quot;voicequot; id=quot;int1quot;> <origin>Boston</origin> Speech <destination>Denver</destination> <date>11032009</date> </emma:interpretation> <emma:interpretation medium=quot;tactilequot; mode=quot;gui“ id=quot;int1quot;> <origin>Boston</origin> Mouse <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 85
  86. 86. EMMA 1.0 – Handwriting Input <emma:interpretation medium=quot;tactilequot; mode=quot;inkquot; id=quot;int1quot;> <origin>Boston</origin> <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 86
  87. 87. EMMA 1.0 – Biometrics Input <emma:emma version=quot;1.0quot;> <emma:emma version=quot;1.0quot;> <emma:interpretation <emma:interpretation id=quot;int1quot; id=quot;int1quot; emma:confidence=quot;.75quot; emma:confidence=quot;.80quot; emma:medium=quot;visualquot; emma:medium=quot;acousticquot; emma:mode=quot;photographquot; emma:mode=quot;voicequot; emma:verbal=quot;falsequot; emma:verbal=quot;falsequot; emma:function=quot;identificationquot;> emma:function=quot;identificationquot;> <person>12345</person> <person>12345</person> <name>Mary Smith</name> <name>Mary Smith</name> </emma:interpretation> </emma:interpretation> </emma:emma> </emma:emma> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 87
  88. 88. EMMA 1.0 – Representing Lattices Speech recognizers, Handwriting recognizers and other input processing components may provide lattice output: A graph encoding a range of possible recognition results or interpretations portland today please from flights to austin 7 1 2 3 4 5 6 8 oakland tomorrow boston From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 88
  89. 89. EMMA 1.0 – Representing Lattices Lattices can be represented using EMMA elements: <emma:lattice emma:initial=quot;?quot; emma:final=quot;?quot;> <emma:arc emma:from=quot;?quot; emma:to=quot;?quot;> <emma:emma version=quot;1.0quot; xmlns:emma=quot;http://www.w3.org/2003/04/emmaquot;> <emma:interpretation> <emma:lattice emma:initial=quot;1quot; emma:final=quot;8quot;> <emma:arc emma:from=quot;1quot; emma:to=quot;2quot;>flights</emma:arc> <emma:arc emma:from=quot;2quot; emma:to=quot;3quot;>to</emma:arc> <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>boston</emma:arc> <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>austin</emma:arc> <emma:arc emma:from=quot;4quot; emma:to=quot;5quot;>from</emma:arc> <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>portland</emma:arc> <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>oakland</emma:arc> <emma:arc emma:from=quot;6quot; emma:to=quot;7quot;>today</emma:arc> <emma:arc emma:from=quot;7quot; emma:to=quot;8quot;>please</emma:arc> <emma:arc emma:from=quot;6quot; emma:to=quot;8quot;>tomorrow</emma:arc> </emma:lattice> </emma:interpretation> </emma:emma> From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
  90. 90. EMMA in Multimodal Framework http://www.w3.org/TR/mmi-framework EMMA Google TechTalk – Mar 6th, 2009 Paolo Baggia 90
  91. 91. InkML 1.0 – Digital Ink Ink Markup Language (InkML), http://www.w3.org/TR/InkML Data format for presenting digital Ink (pen, stylus, etc) Allows the input and processing of handwritings, gesture, sketches, music, etc. <ink> <trace> 10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140, 13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135, 58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205 </trace> <trace> 130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125, 152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200, 150 208, 163 210, 178 208, 192 201, 205 192, 214 180 </trace> <trace> 227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134, 230 148, 234 162, 235 176, 238 190, 241 204 </trace> <trace> 282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129, 291 143, 294 157, 294 171, 294 185, 296 199, 300 213 </trace> <trace> 366 130, 359 143, 354 157, 349 171, 352 185, 359 197, 371 204, 385 205, 398 202, 408 191, 413 177, 413 163, 405 150, 392 143, 378 141, 365 150 </trace> </ink> http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 91
  92. 92. InkML 1.0 – Status and Advances Rich annotation for Ink: Trace, Trace formats and Trace collections Contextual information Canvases Etc. Result of classification of InkML traces may be a semantic representation in EMMA 1.0 Current status is Last Call Working Draft, next will be Candidate Recommendation with release of an Impl. Report test-suite Raising interest from major industries http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
  93. 93. MMI Architecture Specification “Multimodal Architecture and Interfaces“, W3C Working Draft, http://www.w3.org/TR/mmi-arch/ Runtime Framework provides Delivery Interaction Data the basic infrastructure and Context Manager Component Component controls communication among the constituents. Runtime Framework Interaction Manager (IM) Modality Component API coordinates Modality Components (MCs) by life-cycle Modality Modality events and contains the shared Component 1 Component N data (context). Event-based communication between IM and MCs. http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 93
  94. 94. MMI Arch – Laboratory Implementation Implementation of components using W3C markup languages. Delivery Interaction Data Context Manager Component Component SCXML Runtime Framework Modality Component API Modality Component API HTML VoiceXML Modality Modality Component 1 Component N for GUI for VUI http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 94
  95. 95. MMI Arch – Laboratory Implementation SCXML based Interaction Manager. VoiceXML + HTML modality components. SCXML interpreter Server HTTP I/O Processor Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA) CCXML/VoiceXML Server Browser HTML Browser Telephony interface Client Phone Client GUI modality component Voice modality component http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 95
  96. 96. MMI Architecture – Open Issues Profiles Start-up, Registration, Delegation in distributed environment Transport of Events Extensibility of Events http://www.w3.org/TR/mmi-arch/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 96
  97. 97. Emotion in Wikipedia From Wikipedia definition: “An emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviours. It is a prime determinant of the sense of subjective well-being and appears to play a central role in many human activities. As a result of this generality, the subject has been explored in many, if not all of the human sciences and art forms. There is much controversy concerning how emotions are defined and classified.” General goal: Make interaction between humans and machines more natural for the humans Machines should become able: • to register human emotions (and related states) • to convey emotions (and related states) • to “understand” the emotional relevance of events Google TechTalk – Mar 6th, 2009 Paolo Baggia 97

×