Multimodal Applications for the Mobile Generation


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Multimodal Applications for the Mobile Generation

  1. 1. WHITE PAPER Multimodal Applications for the Mobile Generation Bogdan Blaszczak Director of Enterprise Product Management, Intervoice, Inc.
  2. 2. Multimodal Applications for the Mobile Generation Bogdan Blaszczak a Director of Enterprise Product Management, Intervoice, Inc. c 1. Don’t Call Me a Phone browsing and other data-intensive The number of mobile phone users functions quite usable. IDC predicts continues to grow steadily. According that 1.3 billion people will connect to to International Telecommunication the Internet via mobile phones by 2008. Union (ITU), there were over 2 billion mobile subscribers in 2005, or However, the key capability that inspired approximately 30 percent of the this paper is simultaneous voice and ti world’s population. data connectivity. You can enjoy it on the new 3G GSM networks - if you have a Beyond the sheer numbers, the other 3G phone. This is not yet the case on fascinating aspect is the evolution of the 3G CDMA networks, but their high the phone. It can still handle calls, of data speeds with low latency make them course, but even the low-end phones a perfect candidate for “Voice over IP” now come with many additional that will enable the same result. features. It all started with SMS (Short 2. Let’s Talk ve Message Service), which still dominates mobile messaging despite There are essentially three modes of user its extreme limitations. Now phones interactions supported by the current are advertised as multimedia mobile mobile phones. Besides voice, of course, computers and provide various you can use the phone keypad and connectivity, productivity and usually have some way of “pointing and entertainment functions. clicking.” Some recent announcements promise to add handwriting recognition, The cellular networks also have been screen touch gestures, or even physical evolving. The 3G networks provide gestures with the phone. bitrates that are almost as fast as your home Internet connection. This makes
  3. 3. A large segment of mobile users can Device-based voice recognition has The key to the multimodal interactions easily create text messages through been another approach to adding is the effective synchronization of multi-key sequences on the phone voice modality. Your voice is digitized supported modalities. Within the keypad. A few of the new phones add a and processed on the device. The Intervoice Solutions Framework (ISF), text keyboard that makes the task more results are then sent to the server the State Control element of Media straightforward. However, the keys over an established data connection. Exchange provides an effective remain tiny and are tricky to use… and This solution is quite feasible for application environment for multimodal then there also is “BlackBerry Thumb” simple commands or search queries. applications. Media Exchange is a to worry about. However, an additional recognition server-based application execution client must be installed, so the device environment based on industry The inefficiencies of text entry on the must support the required standards. Besides the application mobile devices become apparent when capabilities and provide the sufficient engine, it also provides media the text must conform to normal processing power. Also, the nature of management, administration and spelling and when it has to be this approach requires an explicit reporting services. The corresponding complete. This is usually the case in query/response flow of the application development tools plug web searches or financial transactions. conversation that falls short of into the Eclipse framework. state-of-the-art IVR (interactive voice Wouldn’t you rather say what you want The high level architecture of the response) capabilities. instead of struggling with the keys? multimodal application is shown in the 3. Multimodal Interactions following diagram. The support for simultaneous voice The new support for simultaneous and data connections makes voice and data connectivity on some multimodal interactions effective. The mobile networks presents an use of both modalities in an interaction opportunity to provide a robust, is not new, of course. However, the multimodal experience to mobile users. previous level of mobile technology only lets one modality remain active at Truly multimodal applications should a time. For example, you could not allow the user to respond through the have an active data connection while interface most suitable for the step and talking. This limitation resulted in a the context of the interaction. Let’s “choppy” user interface and a less consider an application that has voice than satisfactory experience. as well as screen and keyboard as interfaces. The user will usually favor The previous attempts to improve the the most effective interface for the multimodal experience depended step. For example, the user may make either on SMS messaging or on voice a voice request to avoid excessive recognition performed on the device. typing. Then the user may “click” on a list or map on the screen instead of The SMS delivery does not interfere listening to long descriptions of this with the voice channel. Hence, a available options. However, the context The two main components required to device-based client can receive an SMS of the interaction may dramatically run multimodal applications are the during a voice call and present a menu influence the choice of interfaces. The Media Server and the Media Exchange based on the message content. This user may be reluctant to make voice platform. The Media Server interfaces solution has been successfully deployed requests while in a crowded place and to the phone network and executes the but SMS limitations prevent it from the typing may be an appreciated call control and voice user interface supporting rich GUIs (graphical user alternative. On the other hand, it is (VUI) scripts provided by the Media interfaces) that modern devices can usually difficult to focus on a screen Exchange. The scripts are encoded in render. The SMS delivery also may while walking. While driving, the use of CCXML and VoiceXML, respectively. introduce delays long enough to make a keypad may be unlawful and it should Both scripting languages are standards the interactions tiresome. generally be avoided - a voice request of the W3C Voice Working Group. is the better choice.
  4. 4. The task of the Media Exchange is to continue providing the expected high Will your next banking application look support and synchronize the operations quality of voice user interfaces. The like a game? I wouldn’t bet on it. of its three clients. The first two are the server based voice recognition can However, a well thought out call control and voice user interface effectively leverage new technologies combination of voice and visual browsers. The third one is a graphical without requiring updates to interfaces will make the caller client running on the mobile device. device software. experience more efficient and satisfying. Most of the implementations in the past required this to be a custom client. The multimodal applications built with Needless to say, the technology to However, some of the new phones Media Exchange also can benefit from deliver simultaneous multimodal provide support for dynamic web all the other technologies that ISF interactions does not eliminate the content. With these capabilities, the provides. For example, application need for good interaction design. The visual presentation can be rendered in personalization can significantly key to the successful design of a direct response to server-side events improve the caller’s effectiveness and multimodal application for mobile users and logic. This is the same technology satisfaction. This could be further is in the flexibility of the flow and that Google Maps uses to provide a enhanced by incorporating the interactions. The following list outlines responsive and efficient interface with presence and location factors in the several of the design considerations. the Internet browsers on the desktops. personalization rules. The potential mobility of the caller is a new • Choice of Modality The synchronization between the dimension with its own specific Where possible, the user should modalities is implemented using State challenges and opportunities. always be given a choice of Chart XML (SCXML) running in the interaction modes. The design should State Control element of the Media 4. Application Design not force the use of one modality nor Exchange platform. The SCXML Well-designed IVR applications require the use of all modalities. The language is a W3C standard for successfully provide critical services and user choices will be driven by the defining state machines in an efficient round the clock access for customers. information presented, the user’s way. In our case, the state charts The improvements in voice recognition preferences, skills, and the represent application core logic that is and language understanding environment. The environment and independent of any user interaction technologies enable a fluid conversation situational factors may exacerbate modes. The mode related details are flow instead of the extensive cascading any user interface shortcomings and taken care of by the additional dialog menus of older IVRs. However, there are prevent the user from completing elements as shown in the architecture still few callers who find current IVR the interactions. diagram. This architecture enhances applications exciting. • Best User Interface per Modality maintainability and expandability of the The design for each modality should The multimodality, the graphical overall application. For example, fully leverage the corresponding rendering, and the high speed data dialogs can be added or customized technical capabilities and provide the connectivity of the new mobile devices independently from the state control best interface possible. For example, enable designers to add a “wow!” and other dialogs. a voice interface that simply iterates factor to their applications. The visual through visual elements is inferior to The visual and voice dialogs are communication engages new senses one based on multi-slot grammars responsible for providing effective and dramatically enhances the and mixed-initiative dialogs. Instead user interfaces for their respective information flow. Given the rich of a directed prompt and response modalities. The visual dialog is based possibilities and user experience sequence, the user will be able to on the established technologies of delivered by these new technologies, speak more complex sentences with dynamic web applications and AJAX we may even see more users who multiple data points. Though the (Asynchronous JavaScript and XML). prefer to interact with self-service visual presentation should help the The voice dialog can leverage the full applications rather than agents. It is user say the right things, the screen power of VoiceXML, including the now quite feasible to provide mobile layout should not constrain the flow voice recognition and text-to-speech applications within domains that used of voice conversation in any way. The provided by the servers. This is the to be constrained to the desktops. It designs of modalities must support same technology that Intervoice has also makes it possible to build social each other without compromising been successfully deploying for networks that are so popular on the their specific benefits and efficiencies. voice-only interactions. Its use in web, like multiplayer online games, or multimodal applications enables us to virtual world interactions.
  5. 5. • Presentation Optimization adjust the level at will. The proper • Graceful Degradation The content provided through each controls should be offered in both The reality of the mobile modality should be complementary. visual (as more/less options) and voice environment is that the radio signals For example, the screen may present modalities (through universal may fade away and connections may the list of items while the voice may commands like “explain”, “keep be dropped. Such problems may not only say the number of items. This quiet”, “I am reading”, etc.). This also necessarily cause a total loss of asymmetry minimizes information will allow the user to focus on a single communication but rather affect only redundancy while still indicating the modality without a distracting nagging some of the capabilities. For availability of multiple modalities. The from the other ones, especially voice. example, a phone may lose the 3G user may choose to respond by • Situational Awareness connection and start operating in a selecting an item on the screen or by The awareness of a user’s situation 2.5G mode. Voice and data asking for more details over voice. and activities can help the application connections would still be available • Presentation Synchronization adjust the interface properly. The but not simultaneously. Other failure Presentations through different user should be able to make an modes may cause a loss of just the modalities must be closely explicit declaration (“I am driving” or voice or just the data connection. synchronized. If the changes do not “I am at a meeting”). The application The application should still maintain happen in a timely fashion, the user also can obtain clues from the a level of functionality under limited will assume error on his part or presence and location information as connectivity capabilities. The user system failure. Timely positive well as noise-level information from may decide to operate under feedback throughout the application the recognizer. Future phones are reduced functionality or just wrap up is critical to the smooth flow of the expected to detect physical what he was doing. In any case, the conversation and to the user’s movement and orientation. Then the user will have a chance for a soft perception of communicating application will know to start talking landing. The application also should effectively with the system. However, when the user keeps the phone down try to recover the lost connections the application has limited control or when he is walking. There has (by retrying the data connection or over the timing of the data channel been some promising research by placing a new voice call). Since and the modalities may get out of conducted in this area. The success of the state of server-side application is sync. The application must be the Nintendo Wii game controller maintained separately from designed to support recovery from proves that the gesture recognition is individual modalities, a recovered such situations. a viable interface. modality will rejoin the conversation • Input Synchronization • Multi-User Interactions at the right point and with the The user may respond through If the application allows multiple users relevant information. multiple channels. Depending on the to participate, the modalities they As you have realized by now, while context of the dialog, those individual each choose may be different. The multimodal applications make inputs may be elements of a application should individualize the interactions more efficient and enrich composite user input comprising of presentations accordingly. the experience, the internal structure multiple modalities. For example, a Furthermore, if the users can of multimodal applications is quite user may point to a graphical element communicate among them, it would complex. The general architecture and speak a voice command. The be highly desirable to provide discussed earlier enables partitioning semantics of individual inputs must be on-the-fly media translations. If you of the application logic into considered to discriminate between consider that users may speak, text, cooperating subsystems. This in turn valid composite inputs and user errors. click, or gesture, the task is quite offers opportunities for code reuse • Adaptive Assistance difficult. The current technologies and improves the maintainability of The level of assistance necessary for a may not be able to handle all cases the application. given user is difficult to predict. The efficiently. However, in applications user’s experience with the creating social networks of mobile The testing of applications for mobile application, physical situation, and users, the advantage of maintaining users also requires new approaches. preferences are some of the deciding preferences may outweigh occasional You will need to take your application factors. A good design must consider quirks of the translation. out of the lab and test it in real world all of these factors to determine the environments. The user valuation of appropriate level of assistance. modalities changes with the physical However, the user should be able to context in which they find themselves.
  6. 6. The crowd, noise, light, movement and The 3G EV-DO (CDMA) networks do specification should list the supported other environmental factors can not support simultaneous voice and UMTS/HSPDA frequencies separately dramatically affect the usability and data connections. On these networks, from the GSM frequencies. performance of the application. You will a multimodal application can only need to test it while driving, walking, or operate and deliver a “unimodal” Intervoice multimodal applications rely sitting at a street-side café. flow of interactions. on a web browser for the visual presentation and interface. In other 5. Technologies Some phones support Wi-Fi and can words, no special software or hardware The simultaneous multimodal maintain data connections over LAN is required on the mobile device for functionality described in this paper (local area network) during voice calls. multimodal applications to function. For depends on simultaneous voice and This enables simultaneous multimodal example, an off-the-shelf Cingular 8525 data connections. This capability is functionality around Wi-Fi hot spots. phone with Windows Mobile 5 currently available only on the 3G GSM For example, T-Mobile MDA, while not PocketPC operating system can be used networks. The 3G GSM networks use a 3G phone, will support multimodal to simultaneously access the voice and the UMTS (Universal Mobile functionality under such conditions. visual modalities of a multimodal Telecommunications System) The future adoption of WiMAX (also application. Cingular 8525 is a branded technology that in turn uses the called 4G) will extend the hot spot size HTC Hermes phone with quad GSM and W-CDMA (Wideband Code Division and make the wireless LAN approach the tri-band UMTS/HSDPA. The critical Multiple Access) radio interface. Many less constraining. requirement is the support for of the 3G GSM networks also deploy JavaScript and AJAX, which enable the HSDPA (High-Speed Downlink Packet Interestingly, EV-DO Rev A provides dynamic HTML updates during the Access) protocol to increase downlink the data speed and short latencies multimodal interactions. The PocketPC speed, reduce latency, and increase the that make it quite capable of carrying Internet Explorer (PIE) browser, which capacity through the better spectral packet voice (VoIP) and thus comes pre-installed on Windows Mobile efficiency. ITU and 3GPP (3rd theoretically enables simultaneous 5 PocketPC devices, provides the Generation Partnership Project) are the voice and data. However, operators required capabilities for visual standards organizations defining the have not yet announced any plans for interactions. It should be noted, above specifications. However, an easy such solutions. An additional push may however, that PIE has more limitations way to learn a bit more is to browse come from future IMS (IP Multimedia than the desktop IE, so not all AJAX Wikipedia at Subsystem) deployments. IMS is a new fueled presentations will work. You can telecom architecture that tries to find additional explanations on the As of March 2007, there were 98 3G leverage the IP-based protocols to Microsoft MSDN web. HSDPA GSM network operators in 52 create a more flexible telecom countries. The two major GSM platform and to deliver “IP multimedia On phones running Symbian S60 (Nokia networks in the United States are services” to the end user. and others), the Opera web browser AT&T (previously Cingular) and provides AJAX support. T-Mobile US. AT&T already has 3G/ To take advantage of the 3G GSM HSDPA in service. T-Mobile US is capabilities, a compatible mobile There are other technologies that could expected to start offering 3G/HSDPA device is required. As of March 2007, be used to provide a thin client for the later in 2007 and 2008. Cingular’s (AT&T) web page listed eight mobile device. Adobe Flash is a very models of 3G phones. Alternatively, any attractive candidate. This is the same The other two major U.S. mobile unlocked 3G GSM phone can be used technology that the ubiquitous Flash operators, Verizon and Sprint, use the (after the SIM card is inserted), as long Player provides for desktop web CDMA technologies instead of GSM. as it supports the GSM and UMTS browsers to enable dynamic multimedia Verizon and Sprint provide 3G frequencies provided on the intended content. Flash has not yet penetrated capabilities based on the EV-DO network. A phone with “quad” GSM the mobile realm to the same extent, (Evolution-Data Optimized) frequencies would provide voice but Adobe seems to be working hard to technology. They are currently connections on any GSM network (with improve the functionality and to support deploying EV-DO Rev A that further the possible exception of Japan). more phones. Further increases in increases the data speed. However, this is not a sufficient mobile processing power and memory specification for the 3G compatibility sizes will improve Flash performance. because UMTS may operate on different frequencies. The phone
  7. 7. A thin client also can be developed in Here are the references to the standards 6. Conclusions Java for mobile devices (J2ME). The mentioned earlier in this paper: Mobility has become a fact of life. J2ME application can communicate over New devices and network capabilities TCP/IP sockets, so the general IP client/ • SCXML: State Chart XML (www. offer an opportunity to improve the server architecture can be retained. To SCXML is a user experience and to create keep the client thin, the complete language based on Harel State completely new categories of mobile screen content and layouts will have to Charts (see the original paper by applications. Multimodal applications be generated by the server and simply David Harel at www.wisdom. can provide the flexibility and rendered by the client. This approach adaptability that mobile users require. can produce an efficient solution that PAPERS/Statecharts.pdf ). This is new and exciting territory. can be adapted to many phones. The SCXML provides an efficient state Intervoice Solutions Framework and its negative aspect is that the client will machine notation thanks to the support Media Exchange provide a robust and have to be installed on the phone either for hierarchical and parallel states. efficient environment for the by the user or by the carrier. The Media Exchange uses SCXML as a development and deployment of additional problem with Windows-based server-side application notation that is multimodal applications for mobile phones is that J2ME may not be independent from the server users. At the same time, all of the pre-installed. J2ME availability on environment and the client types. traditional capabilities expected from Windows Mobile devices depends on the phone vendor and model. For a telecom platform are still available to • CCXML: Call Control XML example, Cingular 8525 and T-Mobile build upon. ( MDA are both Windows Mobile 5 7. Appendix PocketPC phones built by HTC, but only CCXML is a “third party” call control Author: the 8525 comes with J2ME. model, including call joining and Bogdan Blaszczak, conferencing. Intervoice CCXML The Media Exchange server architecture Director of Enterprise Product browser is provided on Media Server. discussed in this paper is a general Management, CCXML documents can be dynamically application server environment that can Intervoice, Inc. produced by the SCXML-encoded logic support a mix of various applications running on Media Exchange. Third Please send feedback to bogdan. and callers. A desktop user with a party, CCXML compliant documents softphone (a desktop application should also work. providing VoIP capabilities) can call and For additional information on Intervoice successfully interact with a multimodal • VoiceXML: please visit application. A combination of a voicexml20/ traditional phone and a desktop also will The products, features and work. The call control part of the VoiceXML is designed for creating specifications discussed in this application can execute call transfers in audio dialogs that feature synthesized document are subject to change the network or it can connect the caller speech, digitized audio, recognition of without notice. to a contact center. spoken and DTMF key input, recording of spoken input, telephony, and mixed All trademarks are the property of their Intervoice Media Exchange and Media initiative conversations. respective owners. Server are based on standards defined by W3C Voice Working Group (www.w3. The standard compliance of the Copyright ©2007, Intervoice, Inc. org/Voice/ ). Intervoice VoiceXML browser provided on Media Server was certified by The VoiceXML Forum (
  8. 8. World Headquarters Intervoice, Inc. 17811 Waterview Parkway Dallas, TX 75252 (US) 800.700.0122 (Int) 1 972.454.8000 International Headquarters Intervoice Limited 50 Park Road Gatley, Cheshire UK SK8 4HZ +44 (0) 161 495 1000 Offices worldwide, including Santa Clara, Orlando, Sao Paolo, Dubai, South Africa, Singapore, Ireland, Germany, The Netherlands and Switzerland. The products, features and specifications discussed in this document are subject to change without notice. All trademarks are the property of their respective owners. Copyright ©2007, Intervoice, Inc.