2005_matzon

Department of Linguistics and Philology
Spr˚akteknologiprogrammet
(Language Technology Programme)
Master’s thesis in Computational Linguistics
10th June 2005
A Speech-Driven
Automatic Receptionist
Written in VoiceXML
Katarina Matzon
Supervisors:
Be´ata Megyesi, Uppsala University
Tobias ¨Ohman, Voxway AB

Abstract
This thesis describes the implementation of a speech-driven receptionist for Voxway AB. The
receptionist was designed to be used by smaller Swedish companies. It answers calls com-
ing into the company and directs the calls to an employee based on speech input from the
user. It also handles unrecognized names and unanswered phonecalls. It was programmed in
VoiceXML and ColdFusion. A database was designed and implemented to store data needed
in order to make the receptionist dynamic and to log call statistics. The telephony applica-
tion was evaluated by test users and a user survey. A website (programmed in HTML and
ColdFusion) was designed to administrate the telephony application and allow companies to
customize the application as well as view statistics about their usage of the application.

Contents
Abstract ii
Contents iii
List of Figures v
List of Tables vi
Acknowledgements vii
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Dialogue Systems 3
2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Design Methods . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Human Communication . . . . . . . . . . . . . . . . . . . 6
2.2.3 Design of Dialogue . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 VoiceXML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 ColdFusion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Programming the Receptionist 12
3.1 Static Receptionist . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Design of Dialogue . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Basic Code . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Building Grammars for Use . . . . . . . . . . . . . . . . . 14
3.1.4 Integrating Error Handling in the Code . . . . . . . . . . . 16
3.2 Integrating Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Building the Database . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Using ColdFusion to Integrate Dynamics . . . . . . . . . . 19
3.2.3 Organizing the Code for Dynamics . . . . . . . . . . . . . 20
3.2.4 Dynamic Queries and Output . . . . . . . . . . . . . . . . . 20
3.2.5 Dynamic Grammars . . . . . . . . . . . . . . . . . . . . . 22
3.2.6 Dynamic Prompts . . . . . . . . . . . . . . . . . . . . . . 23
3.2.7 Implementing Statistical Element . . . . . . . . . . . . . . 23
iii

4 Evaluation 24
4.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Test Users . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Evaluation of Results . . . . . . . . . . . . . . . . . . . . . 25
5 Designing the Web Interface 27
6 Concluding Remarks 30
6.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 30
A Database 32
Bibliography 33
iv

List of Figures
2.1 The three modules of a dialogue system . . . . . . . . . . . . . . . . . 3
2.2 The relationship between SGML, HTML, XML and VoiceXML . . . . 9
2.3 A simple VoiceXML example . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The seven subsystems of VoiceXML . . . . . . . . . . . . . . . . . . . 10
3.1 Stages of Development of Receptionist . . . . . . . . . . . . . . . . . . 12
3.2 Example Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Receptionist Applications’s chain of events . . . . . . . . . . . . . . . 14
3.4 Example of different types of VoiceXML grammars . . . . . . . . . . . 15
3.5 Example of error handling in a dialogue . . . . . . . . . . . . . . . . . 16
3.6 Static event handling for an unanswered call . . . . . . . . . . . . . . . 17
3.7 Example of a possible conversation . . . . . . . . . . . . . . . . . . . . 19
3.8 Query to ﬁnd company name and ID . . . . . . . . . . . . . . . . . . . 21
3.9 Example of ColdFusion output . . . . . . . . . . . . . . . . . . . . . . 21
3.10 Dynamic Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 Dynamic Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Task example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Employee List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Blank form for new employees . . . . . . . . . . . . . . . . . . . . . . 28
v

List of Tables
4.1 User Satisfaction Survey with Average Scores . . . . . . . . . . . . . . 25
4.2 User Satisfaction Scores . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi

Acknowledgements
I would like to thank the people without whom this paper would not be what it is
today. Thank you to both my supervisors Be´ata Megyesi and Tobias ¨Ohman. Thank
you Bea for your encouragement and advice in writing this thesis, and thank you
Tobias for all your encouragement and help on the programming of the reception-
ist. I would like to thank Botond Pakucs at KTH for contributing with advice on the
evaluation of dialogue systems. I would also like to thank my friend Jens Bergqvist
for helping me record incredible sound for the receptionist so that it sounds more
professional. Thank you to all my friends and family who have supported me this
semester and were always around to talk when I needed a break. And lastly, I espe-
cially want to thank my boyfriend, Johan, for being such an incredible support and
help throughout this process, thank you for being my bollplank!
vii

1 Introduction
Natural language processing, the study of linguistics and computer science, is grow-
ing everyday. Everywhere people go today computers are understanding and inter-
preting the human language. One of the branches of computational linguistics is
speech technology where computers ‘understand’ and output speech. More and more
companies are using speech technology. If you call the Swedish railway company
you will be speaking to a computer to book your tickets or if you call the postal of-
fice in the United States you will be speaking to a computer to find out the postal
code you need.
Soon enough we will not need to type into keyboards because it will be standard
to talk to your home computers. People are already speaking to their mini-computers.
For example, when a person calls their friend on their mobile, they just say the
friend’s name and the call is connected (Dobler, 2000). Or when you are driving
in your car and your navigational system is reciting directions for you to follow to
your next destination (Wikipedia). These features improve our lives at home or at
work.
One branch of speech technology is spoken dialogue systems. Spoken dialogue
systems utilize speech technology to enable humans and computers to interact by
means of human speech. Here both aspects of speech technology, speech recognition
and speech synthesis, combine to interact with humans in the form of a dialogue. In
the Merriam-Webster English Online Dictionary dialogue is defined as follows
Dialogue a conversation between two or more persons; also : a similar exchange
between a person and something else (as a computer) b : an exchange of ideas
and opinions c : a discussion between representatives of parties to a conflict
that is aimed at resolution
A spoken dialogue system can then be defined as a system designed to perform a
spoken conversation between a person and a computer. One area where these systems
are increasingly popular is the telephone industry. In the end of the 1990’s, telephone
companies wanted to develop a common language to voice enable the web, in other
words, to build dialogue systems that work over the web and over the telephone.
The result of this discussion was VoiceXML (Voice Extensible Markup Language)
(W3C, 2003). VoiceXML made it much simpler for companies to build web-enabled
applications that include speech over the telephone and expanded the possibilities for
voice applications.
1.1 Purpose
The purpose of this thesis is to develop a speech-driven receptionist for Voxway AB.
Voxway AB is a company specializing in developing and hosting IVR (Interactive
1

Voice Response) applications with speech technology. This task involves developing
an automatic receptionist for small companies where the goal is to form a comfortable
and efﬁcient dialogue between the caller and the automated service.
The dialogue system is programmed with VoiceXML. The receptionist is de-
signed to expect the name or position of a person at the company. In case the reques-
ted person may be reached at several numbers, the application asks which number it
should connect to (mobile, home, work). After the system knows the correct num-
ber, it connects the call. It also handles problems such as unrecognized names, busy
signals, and unavailability.
Besides the dialogue aspect, the application involves designing a database and
web interface that can be accessed by each company in order to customize the ap-
plication to their needs. Each company has its own application content that is stored
in a database and accessed by using the telephone number that receives the call as a
key. The information in the database is managed by the website which is designed to
allow different companies to enter the site with a password and enter the information
for each employee that is necessary for the ‘receptionist’ to be able to connect a call.
The website allows companies to see call statistics about the calls coming into
the company and calls transferred within the company.
1.2 Outline
This paper describes the implementation of an automatic receptionist. Chapter two
gives a background on dialogue systems in order to prepare the user for chapter three
which discusses the implementation of the receptionist from the static receptionist to
the dynamic receptionist. The next chapter describes the evaluation of the implemen-
ted receptionist. The chapter to follow the evaluation describes the website design
and implementation. The paper ends with concluding remarks and suggestions for
future improvements.
2

2 Dialogue Systems
Spoken dialogue systems are systems built to handle human-computer interaction in
the form of speech. A system normally consists of different modules that handle dif-
ferent aspects of the dialogue. A simple system consists of three modules, a speech
recognizer, a dialogue manager and an output generator as seen in Figure 2.1 (Gust-
afson, 2002).
Output
Generator
Speech
Recognizer
Dialogue
Manager
Figure 2.1: The three modules of a dialogue system
The first part is the automatic speech recognizer which converts the speech that
is the input into text that the computer can parse. Once the text is parsed, it is sent
to the dialogue manager which decides how the system should react to the input.
Often, the reaction is to send output to the output component or generator. The output
component consists of recorded prompts or text-to-speech (TTS) which converts a
given output into speech to be recited to the user.
Together these components form a dialogue system. This system can then accept
input as speech, parse this input, decide how to handle the input, and send output
via the generator. This is how a general dialogue system works, but systems are
designed with different goals in mind and each component in the system will be
formed differently depending on the goal. For example, the CU communicator is
an interactive dialogue system for travel information over the phone (Pellom and
Ward, 2000). In comparison, a system with an entirely different goal is August, a
multimodal dialogue system which was used to interact with people at the cultural
center in Stockholm (Gustafson et al., 1999). Since dialogue systems can differ so
greatly, they are divided into three categories.
The first is the task-oriented dialogue. This dialogue has well-defined goals and
this is usually a simple dialogue. Examples include simple question and answer sys-
tems such as the CU communicator mentioned above. Another example is a system
that gives traintimes over the telephone such as the Philips automatic train timetable
information system (Aust et al., 1995). The second type of dialogue is the explorative
dialogue where the goals are not as well-defined but instead the goals are to acquire
knowledge about complex tasks or browse information (Gustafson, 2002). An ex-
3

ample would be an information browsing system such as AdApt which allows users
to find out information about available apartments in the Stockholm area (Gustafson
et al., 2000). Although there is a goal in their interaction it is not easily defined. With
AdApt, the goal may be to find an apartment to buy or simply to browse available
apartments out of curiosity. The third type of dialogue is context-oriented. These dia-
logues are focused on the actual dialogue situation. The primary goal for the user
in this interaction is to be entertained (Gustafson, 2002). This dialogue is based on
the system, its locations, or its surroundings. An example of this would be a mu-
seum guide system that talks about the exhibition it is stationed in such as August,
the system described earlier (Gustafson et al., 1999). August has no goal other than
conversing.
Today, task-oriented dialogue systems are the most common. Mostly because
it is easy to measure errors and effectiveness of the systems since the goals are so
clear (Gustafson, 2002). But the other two types are possible and would expand the
possibilities of the dialogue systems endlessly.
A more in-depth look into each of the components of dialogue systems will be
explored below.
2.1 Speech Recognition
Automatic speech recognition (ASR) is the task of converting speech to text that
can then be parsed by the computer. Determining what type of recognizer to build
is one of the first steps. Many types of recognizers exist. One distinction is based
on whether the system has prior knowledge about the user’s speech characteristics or
not. Speaker-dependent (SD) systems are designed to understand speakers previously
trained on the system, and speaker-independent (SI) systems are trained to respond
to a large group of people where training for each individual would be impossible
(O’Shaughnessy, 2000). SD systems exist, for example, in mobile phones where the
speech recognizer recognizes its owner’s way of pronouncing a person in the phone
book exclusively. SI systems are much harder to make successful considering the
large variations in speech that need to be taken into consideration.
Inter-speaker variability is the difference in speech between individuals. These
differences include dialects, emotion in speech, sex of the speaker, and age of the
speaker. For example, the accent of a person from the south of Sweden is very dif-
ferent compared to the accent of a person from the north of Sweden. A SI recog-
nizer needs to account for these differences in order to understand a broader scope of
people. Besides these differences even the emotion in a voice differs between speak-
ers. For example, the level of excitement in a voice will also be different depending
on the speaker. All of these differences and more need to be considered when building
a SI system.
Besides inter-speaker variability, intra-speaker variability exists. Intra-speaker
variability is the variability of speech within one person. One person is unlikely to
utter the same exact thing more than once. The combination of intonation, pauses and
emphasis is difficult to repeat exactly. This effects both SI and SD systems. A speech
recognizer needs to be broad enough to handle these subtle differences in speech and
be able to recognize the words that are spoken, but it needs to be narrow enough so
that it does not confuse similar words.
4

Besides the aspects of speech, the nonspeech aspects are important to consider
as well. Background noise plays a huge factor for the recognition. If a person is
sitting in a crowded restaurant or in an empty room, it will be more difficult for
the recognizer to recognize the person in the restaurant because of all the noise in the
background. Also channel distortion needs to be considered. If a person is interacting
with a system via a telephone the connection can worsen the recognition because of
bandwidth limitations in the telephone network. Mobile phone connections can be
bad or if a person calls from overseas, the connection can be affected and make it
more difficult for the recognizer to understand the caller. The perfect conditions for
a speech recognizer is one person in a silent room interacting with the computer
without a medium such as a telephone. These conditions are, of course, not that
common.
Once speech is recognized and the actual text is extracted, the computer parses
the input in a couple of ways. Each speech recognizer is equipped with a linguistic
component that will parse the text before it is sent to the dialogue manager. The
simplest parser is a static grammar which means that the parser has an unchanging
grammar that the input is matched to, to try to find the best match. These matches
can be similar to one another and therefore lists can be made by the system listing the
most similar match to the the least similar. In more complex recognizers, a lexicon or
corpus with a much larger number of words along with a grammar interact to parse
the meaning of the input (Gustafson, 2002). This allows for more possibilities when it
is impossible to know exactly what inputs will be entered. A more complex linguistic
component allows for a more robust system.
Once speech is recognized and parsed so that the system can interpret it, it is sent
to the next component, the dialogue manager.
2.2 Dialogue Management
The dialogue manager in a dialogue system is the backbone of the system. Once a text
is parsed by the recognizer, the dialogue manager has to decide what to do with the
input it has received. There are several different aspects to consider in the design of
the dialogue manager so that it can handle input correctly and a successful dialogue
can be programmed. The first and most basic is which method of design the designer
chooses.
2.2.1 Design Methods
A few different ways to design a dialogue system exist. Design by inspiration, design
by observation and design by simulation (Gustafson, 2002). Designing by inspiration
is when a designer decides how he is going to design his dialogue without consulting
any external party. This is a bit risky since one person cannot think of all the possibil-
ities in a conversation and it relies solely on the linguistic competence of the designer
(Gustafson, 2002). This can be considered an option in simple systems where the pur-
pose is for the user to reach a goal. Here it works since the user can be trained on how
he can reach his goal, and then the dialogue system can be considered a success. In
more complex systems, it will most likely not give a good result. Designing by ob-
servation is when the designer observes communication between humans emulating
the situation he wants to depict in his system and trys to incorporate aspects of that
5

communication into the system. Lastly is design by simulation (wizard-of-oz tech-
nique) which is when some or all parts of a system are simulated and thus different
aspects of the dialogue can be tested (Gustafson, 2002). This is quite a useful strategy
since it will make the system more realistic since it will be a human speaking to a
simulated interface instead of a human speaking to a human. The type of system and
the possibilities the designer has will decide which design strategy is best suited for
the dialogue system.
Once a design method is chosen it is important to consider certain principles that
exist in human communication.
2.2.2 Human Communication
In order for a successful dialogue to be designed, the designer needs to observe hu-
man dialogue and account for all the unwritten rules that exist in human conversation.
Only by following these rules and principles will the designer be able to design a dia-
logue system that people ﬁnd as natural as speaking to a human.These principles and
rules are discussed below.
Certain assumptions exist when humans communicate in order for a conversation
to be satisfactory to all parties. Principles have been studied and deﬁned so that com-
munication can be more easily studied. Grice (1975) has famously written about four
well-known maxims that govern all conversation and when they are not followed, a
conversation can be considered unsatisfactory. These four maxims are listed below.
• Quality. This means that in a conversation a person should always be sincere.
People expect to hear the truth and will therefore be surprised if this maxim is
not followed.
• Quantity. This means a person should say neither too little nor too much. If a
person doesn’t say enough then it could lead to confusion and the same could
happen if they say too much.
• Relevance. This is easily explained as what a person says should always be
relevant in conversation. If a person starts speaking of something unrelated to
the current subject then it will confuse the listeners.
• Manner. This means avoid ambiguity. Be clear and to the point otherwise it
can lead to confusion.
All of these maxims need to be upheld in a dialogue system if the user is to feel
comfortable with the conversation.
Besides underlying principles in conversations, the conversation structure is im-
portant to follow. Conversations between humans are structured in turn construction
units (TCU). Each speech act by each partner is considered a TCU and these TCUs
are surrounded by turn relevance places (TRPs) (Norrby, 1996). For example, if one
person directs a question to another person, that is considered a TCU. The answer the
other person gives is another TCU and the time in between the question and answer
is a TRP. TRPs are extremely important because they signal when another party can
take a turn. TRPs are the natural place to take a turn if you are participating in a con-
versation. They can be signalled by a longer pause, the intonation at the end of a TCU
and other signals that humans perceive automatically. It is important for the dialogue
6

system to understand when a pause is a TRP or not, otherwise a conversation can be
frustrating for the user.
These TRPs can be easier to find if the role of initiative in the dialogue is clear.
When one person starts a dialogue she has initiative. The initiative can switch between
the different parties as the conversation moves along to keep it going forward. A
conversation is considered single initiative if one party always takes initiative (Gust-
afson, 2002). For example, the Danish flight ticket reservation system is a mainly
system-directed task oriented dialogue (Bernsen et al., 1997). Mixed initative is when
either party can take initiative (Gustafson, 2002). This can be seen in a system where
the user can prompt the system for an answer to a question and the system can do
the same with the user. An example of such a system is the Waxholm system which
gives boat information for the Stockholm archipelago and was designed to allow user
initiative as well as system initiative (Carlson et al., 1995).
These assumptions and underlying rules of conversation need to be taken into
consideration when designing a dialogue manager. Otherwise it will most likely be
unpleasing to the human user. The next step is programming the actual dialogue.
2.2.3 Design of Dialogue
Once the design method is decided and conversation principles are considered, the
designer is ready to program the type of dialogue the manager will understand and
interpret.
To help in the design process, the designer can gather examples of dialogues to
base design on or if this is not a possibility, the designer can use scenarios (Gustafson,
2002). Scenarios are when a designer considers all the different types of dialogues
that can occur with the system in order to form a successful design. Scenarios are very
helpful in that they take the system through as many different dialogues as possible.
With the help of the gathered examples or scenarios, a dialogue is designed. The
dialogue manager can then be programmed to interact with human users in the limited
way that the system was designed to. But in order for the system to reach a greater
scope of information, the dialogue manager may interact with a database. A database
stores all the information that could be relevant to the dialogue. For example, in a
train booking system, where people call to book tickets, the dialogue manager must
interact with the database in order to find out information about the trains that are
relevant. The database may give input to what the acceptable output may be. Once
the dialogue manager has processed the input, the appropriate output is sent to the
next component, the output generator.
2.3 Generator
Output can be generated in a few ways in a dialogue system. One way is through
recorded prompts that are played back to the user. Another way is generated through
a TTS system.
Recorded prompts can be used when there are messages that are always played
in every dialogue. They are chosen because it is a real voice instead of a computer
generated voice since human voices could be considered more pleasing to human
listeners.
7

TTS is used when the output can not be foreseen. TTS does not sound as natural
as a human voice and therefore recorded prompts are sometimes preferred, but, in
many systems, output is often unique which makes TTS extremely powerful. TTS
systems generally synthesize speech from text using linguistic processing and con-
catenating small speech units. It converts input text into speech waveforms using
algorithms and previously coded speech data (O’Shaughnessy, 2000). Speech syn-
thesizers can be characterized by the size of speech units they concatenate and by the
method used to synthesize the speech (O’Shaughnessy, 2000). Large speech units
produce high-quality speech but requires a lot of memory while efficient coding re-
duces memory but also reduces speech quality. Most commercial synthesizers have
been based on word or phone concatenation (O’Shaughnessy, 2000).
Two commercial applications exist for speech synthesizers, voice-response sys-
tems which handle input text of limited vocabulary and syntax, and TTS systems
which accept all input text (O’Shaughnessy, 2000). TTS systems construct speech
from text using small speech units and much linguistic processing whereas voice-
response systems simply concantenate speech from the large units the system has
stored. TTS systems are the systems that are of interest for most spoken dialogue
systems.
Several different methods of synthesis exist for TTS systems which include form-
ant synthesis, articulatory synthesis, linear predictive coding synthesis, and wave-
form synthesis. The highest-quality synthesized speech uses waveform coders and
large memories (O’Shaughnessy, 2000). These synthesizers can be considered quite
advanced for certain systems. Two other types of synthesizers are terminal-analog
synthesizers and articulatory synthesizers (O’Shaughnessy, 2000). With articulat-
ory synthesis, the sound is created by modelling the actual vocal tract shapes and
movements. In terminal-analogue synthesis only the acoustic results of speech are
modelled without taking the vocal tract into account. The choice of synthesizer is
greatly influenced by the size of the vocabulary. For example, a system that requires
a synthesizer that can produce unlimited text will generally be of lower quality than
a system that has limited output.
The generator makes up the last of the three components that a dialogue system
consists of. Now I will discuss one possibility to implement a dialogue system. This
is the implementation that will be used in this thesis. If you want to learn more about
speech synthesis or speech recognition refer to (O’Shaughnessy, 2000). For more
information on dialogue systems refer to (Gustafson, 2002).
2.4 VoiceXML
VoiceXML (Voice Extensible Markup Language) is a powerful markup language that
descends from SGML (Standard Generalized Markup Language). VoiceXML has
two older siblings, HTML and XML, which were developed as children of SGML
(see Figure 2.2). Whereas HTML is considered a single SGML application, XML is
a metalanguage just as SGML. A metalanguage is a language that is used to define
other languages (Abbott, 2002). All the descendents of SGML are markup languages
which means that information content is stored with tags that describe the meaning
of the information content (Abbott, 2002). XML was developed by a designer to
generalize the success of HTML and also allow for a broader user base than SGML
8

by taking away some of the complexities of its mother language (Abbott, 2002).
VoiceXML can be considered a young sibling to HTML.
VoiceXML
SGML
XMLHTML
Figure 2.2: The relationship between SGML, HTML, XML and VoiceXML
Although it is a sibling it interacts differently with its users than HTML since in
VoiceXML applications the user speaks to the computer whereas in HTML, the user
communicates visually with the computer with their mouse or keyboard (Abbott,
2002). VoiceXML was developed after discussion between telephone companies to
develop a common language to voice enable the web. The ﬁrst version was released
in August 1999. A simple example is seen in Figure 2.3. The output after running
this example would be a TTS of the text ’Hello World’.
<?xml version="1.0"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
<form>
<block>Hello World!</block>
</form>
</vxml>
Figure 2.3: A simple VoiceXML example
VoiceXML can be seen as a complete dialogue system for telephony applications
where the designer simply has to program the dialogue manager and build grammars
for the system. This can be seen in the seven subsystems which are listed below and
illustrated in Figure 2.4.
Network Interface Allows HTTP to communicate with a web server.
VoiceXML Interpreter Software that can be considered the dialogue manager. This
is where the programming and construction of the dialogue takes place.
TTS As discussed above translates text to speech.
Audio Allows audio prompts to be played or recorded.
9

Speech Recognition As discussed above translates user utterances into text. Voice-
XML uses speaker-independent speech recognition where the interactions are
structured dialogs where the user is limited to a ﬁnite vocabulary.
DTMF (dual tone multi-frequency) Translates keypad input into characters
Telephony Interface Enables communication with telephone networks.
Network
Interface
TTSAudioSpeech
Recognition
VoiceXML InterpreterTelephony
Interface
DTMF
Figure 2.4: The seven subsystems of VoiceXML (Abbott, 2002)
By putting together speech recognition, speech synthesis, XML and the web in
this one powerful language, VoiceXML is able to extend the reach of the web since
it allows it to be accessed from anywhere. It makes the web easier to use especially
for people with disabilities such as blindness or illiteracy. In addition, it increases the
options for human-computer interfaces since it is an inexpensive option compared
to other voice applications (Abbott, 2002). VoiceXML has taken the expensive high-
end technology of speech technology and combined it with markup language to make
speech technology something that is available for even low-end systems.
VoiceXML works by interpreting between the user and the web server. The Voice-
XML code lies on a server and is accessed by the web or by a telephone number. The
code is processed and able to form a dialogue with the caller. Although this is power-
ful in and of itself, it is not very exciting. It can be compared to a static web page, the
results never change. In order to make it dynamic it can integrate with a web applic-
ation server which allows it to connect to a database. One such application server is
ColdFusion.
2.4.1 ColdFusion
ColdFusion was created in 1995 to introduce dynamics onto the internet (Danesh
and Motlagh, 2000). Coldfusion interprets commands given by the web and connects
to the database to retrieve the necessary information. For example, a website that
contains many articles uses an application server such as ColdFusion to access the
articles in the database. Otherwise each article would have to have its own webpage.
This is what makes the web dynamic. When ColdFusion integrates with VoiceXML
it allows telephony applications to become dynamic. ColdFusion is responsible for
getting information to and from the database in the same way it does with regular
webpages, but with voice applications it is interpreted by the VoiceXML gateway in
10

order for the information to be processed and found in the database. ColdFusion code
can be integrated into VoiceXML applications which makes it very simple and easy
to learn. Simple SQL statements are used to retrieve the necessary information from
the web and this information continues to be processed by the VoiceXML code.
11

3 Programming the Receptionist
The receptionist is programmed using VoiceXML and ColdFusion. Since the other
parts of a dialogue system are included in the VoiceXML system (see section 2.4), the
focus of the implementation will be on the design and implementation of the program
code. Designing the receptionist has several stages of development (as seen in Fig-
Statistics
Dynamic Code
Database Design
Event Handlers
Static Code
Figure 3.1: Stages of Development of Receptionist
ure 3.1). The ﬁrst stage involves designing a static receptionist where no dynamic in-
formation exists to make sure that the program can run with hard-coded information.
The next step involves integrating event handlers that will handle misrecognitions
and other events. Once these two pieces are working, a database is developed that
will allow the information that the receptionist uses to be dynamic. After the data-
base is done, the static receptionist is reprogrammed to include ColdFusion markup
language (CFML) which will enable communication with the database. Once the dy-
namics are in place, I am able to program in statistical elements that are important for
administrative purposes such as call length, time the call started, phone number that
the user called from, and the number the user called. After this, a website is designed
that will allow companies to submit, change, or delete information in the database.
Each of these developments is discussed below.
12

3.1 Static Receptionist
3.1.1 Design of Dialogue
Before programming the receptionist, the dialogue is designed. Since it is a simple
dialogue, it is designed by inspiration and some observation of receptionist situations.
A dialogue needs to be designed that upholds Grice’s four maxims as discussed
above, where the turn relevance places (TRPs) are obvious to the caller and also
makes the system’s dialogue simple so that the user will model their dialogue to the
system’s. The best approach is to be direct and to the point in as few words as pos-
sible. The dialogue is designed to be single-initiative where the system will always
direct the caller. Although more experienced users have the possibility to barge-in
which interrupts the computer when it is speaking which makes the dialogue more
efficient. An example dialogue can be seen in Figure 3.2.
(1) Computer: Välkommen till företaget. Vem vill du prata med?
Caller: Anna Matzon.
Computer: Vill du prata med kundservice Anna Matzon?
Caller: Ja.
Computer: Vill du bli kopplad till jobbtelefon, mobilen eller hemtelefon?
Caller: Jobbtelefon.
Computer: Vars˚agod. Snälla vänta medans jag kopplar samtalet.
(samtalet kopplas)
(2) Translated into English
Computer: Welcome to the Company!
Who would you like to speak to?
Caller: Anna Matzon.
Computer: Would you like to speak to customer service Anna Matzon?
Caller: Yes.
Computer: Would you like to be connected to work, mobile, or homephone?
Caller: Workphone.
Computer: One moment. Please wait while I transfer your call.
(call transfers)
Figure 3.2: Example Dialogue
In this conversation, quality is upheld since there is no false statement in the
conversation and the system is therefore sincere. Quantity is also upheld since the
questions are simple but informative so that the user knows what response is neces-
sary. The conversation upholds the relevance maxim since all the questions directed
by the system are related to the goal of connecting the caller to a callee. Since the
questions are unambigious, the manner maxim is also upheld. And in this way, all
four maxims are satisfied. Since the system mostly asks questions, the TRPs are also
clear to the user since an obvious TRP is the end of a question. The user is placed
in a single-initiative situation since the questions are always directed to the user, and
the user should not feel a need to ask questions in return.
The goal with the receptionist is not to have a long conversation, but to connect
the caller to a callee as simply and quickly as possible. This dialogue succeeds on
13

that aspect while upholding the rules of human conversation. The implementation of
this design is discussed below.
3.1.2 Basic Code
The static receptionist where all values are hard-coded, is programmed solely with
VoiceXML. In the static version, the program code consists of one document that is
followed linearly to connect the caller to a fixed destination. This chain of events can
be seen in Figure 3.3.
CalleeCaller
Transfer
Call
Callee
Number
Confirm
Callee
Callee
Name
Figure 3.3: Receptionist Applications’s chain of events
In the first part of the code, speech synthesis is used to ask who the caller would
like to speak to. The response the caller gives has to be a part of the active grammar
in order for it to be accepted. The grammars are discussed more below.
If the user gives a response recognized by the system, the system confirms the
recognized person that the caller chose. If the person is confirmed, the user is then
asked by a speech synthesis prompt which telephone number she would like to be
connected to. This response is also directed by a grammar. In the static version, the
computer asks every person if they want to be connected to home, work or mobile
phone since no database exists with information if one employee has more than one
number or not. If it is incorrect, the code starts from the beginning. Once the number
is retrieved, it goes to the next section which is the transfer section. In this section the
call is transferred to the phone number that the caller wants to be connected to. If the
number is busy the caller is told that they have to call back and a similar response if
no one answers. After the call has been transferred and has returned, the system has
a simple last message before the call disconnects.
But in the static code, the telephone number is always the same since it is hard-
coded. Therefore, the static code is pretty uninteresting to use except as a base to
build on. How this static code turns into a useful dynamic code is discussed later in
this chapter, but first grammars and event handlers will be discussed.
3.1.3 Building Grammars for Use
In building the grammar for the receptionist, the goal is to keep the accepted re-
sponses short and simple so that the dialogue will be efficient and at the same time,
14

the speech recognizer will be able to work easily with short phrases. As discussed
earlier, VoiceXML is built up of seven subsystems. One of these subsystems is the
speech recognizer. In order for the recognizer to recognize user input, it needs to
be told what the accepted responses are so that it can try to match them with the
user input. This is done with grammars. A grammar can be built in several ways in
VoiceXML. It can be a simple list of options, an inline grammar that is placed where
it is used, or an external grammar that is placed in another document. Examples of
these three are found in Figure 3.4. For the static code, an external grammar is used
for both grammars. The first grammar is all the acceptable names a user can ask for
(name grammar) and the second is the different types of telephone numbers they
could be connected to(number grammar).

<option value="röd">röd</option>
<option value="bl˚a">bl˚a</option>
<option value="grön">grön</option>

<rule id="number" scope="public">
<one-of>
<item>jobbet</item> 
<item>mobilen</item> 
<item>hemma</item> 
</one-of>
</rule>
Figure 3.4: Example of different types of VoiceXML grammars
As seen in Figure 3.4, the external grammar is identical to the in-line grammar,
the only difference being that an external grammar is placed in another document
instead of in the code. They are composed of rules that are defined by listing the
possibilities. The options grammar is a bit different since there are no rules, instead
a field has a set of options that defines the grammar. An external grammar is chosen
for both grammars in the static code since it is neater and does not clutter the code.
Since it is an external grammar, the rules can be more expansive as well.
Since these grammars are what the speech recognizer will try to match to the user
input, the text is written as say-as text which is similar to orthographic transcription.
For example, Matzon is written matson since the z is pronounced as an s when spoken.
Although it is written as it sounds, it is not phonetically transcribed.
Once the grammars are implemented, the system recognizes an accepted name
and connects the caller to the static phone number. But what happens with input that
is not included in the grammar? Event handling is discussed in the next section.
15

3.1.4 Integrating Error Handling in the Code
Error handling is necessary in order to handle exceptions in a way that is pleasing
to the user. Errors introduced by imperfect recognition is a large problem facing dia-
logue systems (Choularton, 2004). Two general approaches exist to tackle this prob-
lem, error avoidance and error handling (Choularton, 2004). VoiceXML has built-in
error handling for certain exceptions such as nomatch and noinput. Nomatch is when
a person’s response does not match any items in the specified grammars whereas
noinput is when the user gives no audible response. In VoiceXML, by default, both
of these are handled with a simple error message with a TTS voice and then re-
prompting the user for a response. This is a potentially frustrating scenario for a user
since they would hear the same error message every time they give an unacceptable
response. It is important that the exceptions are handled differently depending on
the number of times the user has given an unacceptable response. Since the system
wants to be natural, repeating the same question again and again is not desirable.
According to Shin et al. (2002), user behavior when met with an error is to rephrase
or repeat their response. This user behavior can be modelled in dialogue systems to
manage dialogue when errors are introduced (Choularton, 2004). This way, the user
is prompted once to repeat their answer and the second time they are given more
specific instructions to rephrase their response. This approach follows the most nor-
mal way of handling errors even if it is not the most desirable since the information
from the user’s first response is discarded (Gorrell, 2003). For example, if the user
responds with an unrecognized response one time, the message to the user will be
different than if it is the third time. An example conversation with error handling is
seen in Figure 3.5.
Caller: ehm, jag vet inte.
Computer: Jag är ledsen. Jag förstod inte. Vem vill du prata med?
Caller: ehm, jag vet inte.
Computer: Jag känner inte igen det namnet. Du kan säga namnet eller
funktionen av personen du vill prata med.
Caller: Jag kommer inte ih˚ag.
Computer: Tyvärr s˚a förstod jag inte. Jag kopplar dig till kundtjänst.
(4) Translated to English
Computer: Welcome to the company. Who do you want to speak to?
Caller: ummm, I don’t know
Computer: I’m sorry I did not understand you, who would you like to speak
to?
Caller: Umm, I don’t know
Computer: I don’t recognize that name. You can say the name or position of
the person you would like to speak to.
Caller: I don’t remember.
Computer: Unfortunately I did not understand. I will connect you to customer
service.
Figure 3.5: Example of error handling in a dialogue
16

Strategies that take longer but produce fewer errors and corrections are preferred
by users (Hirschberg et al., 2000). As seen in the example above, if the system is
unable to recognize an accepted answer three times in a row, the system connects
the caller to customer service that can help them. This is a simple way of handling
errors where after three attempts general help is given to the user (Gorrell, 2003). I
choose to do this after three times since it gives the caller three opportunites to get
to their desired person each time with slightly more specific instructions. If they are
still unsuccessful after the third time, there is obviously a problem. More advanced
techniques in error handling exist which take many aspects of the conversation into
consideration as seen in Higgins - a dialogue system for investigating error handling
techniques (Carlson et al., 2004).
I have not implemented unique error handling for the number grammar where
the user can respond with one of three options: mobile, home, or workhphone since
the options are listed for the user in the question. It is unnecessary since the error
handling would be simply reprompting the user again.
The number grammar and the name grammar are the only two grammars where
error handling for the user response is necessary. Error handling is also necessary
for events pertaining to the phonecall. For example, error handling is necessary if
the call is transferred to a number that is busy or has no answer. This is handled in
the static version by simply stating that the person is busy or isn’t answering and
thanking them for their call as seen in Figure 3.6. Once the dynamics are built in, the
user is given the option of trying another number or another person.
(5) Computer. Anna Matzon svarar inte. Tack för samtalet, prova gärna igen
senare.
Computer: Anna Matzon is not answering. Thank you for your call, please try
again later.
Figure 3.6: Static event handling for an unanswered call
To summarize, the static code is coded in VoiceXML where a person calls in, asks
for a person that is in the grammar, responds with the type of number they want to
call and are connected to a static number. If their responses are unacceptable, special
event handlers exist. Also if the number is busy/noanswer, they are informed. It is
quite obvious that this code is not very powerful. The force comes when the code
becomes dynamic. In order for it to be dynamic, it needs a database to hold all the
necessary information.
3.2 Integrating Dynamics
The first part to integrating dynamics to the static code is building a functional data-
base. Once the database is successful, ColdFusion can be integrated with VoiceXML
to connect the database to the program.
3.2.1 Building the Database
An efficient database is necessary to build an acceptable system. Without a working
database, the system is not functional which is why the database design is so import-
17

ant and central to the entire system. The database can be viewed in Appendix A. It
consists of five tables which are listed below.
• Company
• Employee
• Tilltal
• InCall
• TransferCall
The Company table holds information about each company. Each company has
a unique id which is used to separate the information in the other tables between
companies. The Employee table holds information about each individual employee
including their telephone numbers and position at the company. Each employee has
their own unique id which separates the employees in the Tilltal table as well. The
Tilltal table is the source of the grammar for all the names. Here, each name that can
be used to reach a person is registered with that employee’s ID. The last two tables,
the InCall and TransferCall tables hold information about the calls for administrative
purposes. In order to test that these tables with the information included as above are
efficient and functional, scenarios that can happen with a caller are designed and how
these events effect the database are tested. A few scenarios are accounted for below.
All the scenarios begin by a caller calling a certain telephone number which
identifies the company in the database. Knowing which company it is, the system
finds the appropriate welcome message and plays the message to the caller. After the
welcome message, the system asks who the caller wants to speak to. The caller then
responds with a name (in our example the name is Anna).
The system then searches in the Tilltal table of the database with the id of the
company as above to find an entry of the name Anna. It then finds an entry, connects
it to the employee table with the employee ID, and finds the filename with the em-
ployee Anna’s full name and asks the caller if he wants to speak to Anna Matzon.
If the answer is yes, the caller is connected to one of the telephone numbers in the
Employee table. If no, the system has to start from the beginning but this time elim-
inating the employee Anna Matzon as one of the options. In this way the system can
search through the names in the Tilltal table to find a different result. This is done by
eliminating the previous employee’s ID from the search.
One variation of the above scenario is when a caller wants to speak to a group,
for example sales or customer service. If the caller asks for customer service then the
computer is going to find the employee that has customer service as her position. The
problem comes when the computer wants to confirm the callee with the caller. If the
computer says the callee’s actual name then the caller has no idea if it is correct or
not. An example of this can be seen in Figure 3.7. A simple solution to this problem
is that instead of simply having their names in the confirmation, the confirmation
states their position along with their full name so that if the person calling does not
know the callee’s name they will still know they are being connected to the correct
person.
The next scenario is how the database should handle the calls that aren’t connec-
ted. A first thought is that for the calls that aren’t answered or are busy and aren’t
automatically connected to voicemail, the system could have a message system of
18

Caller: Kundservice.
Computer: Vill du prata med Anna Matzon?
Caller: Jag vet inte, jag antar det.
Computer: Jag är ledsen. Jag förstod inte. Vill du prata med Anna Matzon?
Caller: OK.
Computer: Vill du bli kopplad till jobbtelefon, mobilen, eller hemtelefon?
(7) Translation in English.
Computer: Welcome to the company. Who do you want to speak to?
Caller: Customer Service
Computer: Would you like to speak to Anna Matzon?
Caller: Umm, I don’t know, I guess.
Computer: I’m sorry I did not understand. Would you like to speak to Anna
Matzon?
Caller: Ok.
Computer: Would you like to be connected to workphone, mobilephone or
homephone?
Figure 3.7: Example of a possible conversation
its own. But on further insight, the complications of a messaging system outweigh
the benefits. Since most employees are assumed to have voicemail already, it is very
complicated work for something that in most cases already exists. Therefore, for the
few cases where voicemail does not pick up a busy or unanswered call, the caller will
be asked if they would like to call another number or another person.
After running through all the above scenarios and several others, the database
seems to be functional and effective for the receptionist’s goals.
The database’s interaction with the program can be seen in the ColdFusion quer-
ies that are discussed below.
3.2.2 Using ColdFusion to Integrate Dynamics
Dynamics are integrated by writing SQL queries that pull information from the data-
base dynamically instead of being hard-coded. This information is then used by the
VoiceXML code using CFOUTPUT, or VoiceXML information is converted to Cold-
Fusion to be used in new queries.
The process works as follows. When a call comes in, ColdFusion queries the
database to find out which company the call is meant to go to. Once this is extracted,
the information is used to pull an appropriate welcome message and the first question.
When the caller responds with who they want to speak to, this response is stored
in a VoiceXML variable. This variable is converted to a ColdFusion session variable
which is used to query the database and pull out the entire name of the callee.
This result is then used to confirm that it is indeed the correct employee and if it
is, it is sent to the next document where the caller is asked which telephone number
they would like to to call. The options are dynamic depending on which numbers the
19

employee has in the database. Once the caller responds with a number, the number is
sent to the next document and used to transfer the caller.
If the transfer is unsuccessful in any of the ways discussed above in the static
code, the caller is sent to another document where they are asked if they would like
to try another number if the employee has more than one, and if not, if they would
like to try another person. If they would, the whole process starts again without the
person they have just tried to reach available in the grammar.
3.2.3 Organizing the Code for Dynamics
In order for the dynamics to work properly, the static code needs to be separated into
several documents so that each part has its own document and also since ColdFusion
runs before VoiceXML on each page, VoiceXML variables need to be sent to a new
document in order to be used by ColdFusion. The documents are divided as follows.
• index.cfm - the initial variables for both ColdFusion and VoiceXML are set.
• initial.cfm - the first welcome message is played and the first queries are run
to extract necessary information from the database.
• person.cfm - the first question is asked to the caller, who they’d like to speak
to.
• confirm.cfm - this document confirms the person that the system recognized
from the variable sent from person.cfm
• number.cfm - the number to call is extracted by a new question to the caller.
• xfer.cfm - the caller is transferred to the number from the previous page.
• reconnect.cfm - redirects the caller if necessary.
With these separate documents, it is easy to pass and retrieve information from
the database. A few VoiceXML variables that need to be converted into ColdFusion
variables are able to be passed between documents and thus be converted. These are
the caller’s number and where they were calling. These are sent from index.cfm to
inital.cfm where several are inserted into the database. Another VoiceXML variable
is the name that the caller said they wanted to speak to. This name is sent from per-
son.cfm to confirm.cfm where it is used in a query to pull out the fullname of the
person in order to confirm. The final VoiceXML variable that needs to be conver-
ted is the number that the caller wants to be connected to. Once this is retrieved in
number.cfm it is sent to xfer.cfm.
Except for the above conversions from VoiceXML variables to ColdFusion vari-
ables, the ColdFusion code is simply integrated in the VoiceXML code so that Voice-
XML can use the information from the database. This includes dynamic queries,
grammars and prompts which are discussed below.
3.2.4 Dynamic Queries and Output
The CFML is a language very similar to VoiceXML and HTML. ColdFusion is the
medium between the database and VoiceXML. In order to make the static code dy-
namic, queries to retrieve information from the databases need to be included. The
20

most important pieces of the code are the queries and the output of these queries.
These queries can be placed anywhere appropriate in the document between the
CFQUERY tags. The queries are written in SQL and pull out the necessary informa-
tion from the database. These queries can include session variables that have already
been set but they may not include VoiceXML variables. An example can be seen in
Figure 3.8 where the query finds the company’s ID and name where that company’s
telephone number is the same as the session variable bnumber. These results are then
set to session variables since they will be referenced throughout the code.

<CFQUERY NAME="company_id" DATASOURCE="telefonist">
SELECT c.id as comp_id, c.name as comp_name
FROM company c
WHERE c.BaseNumber=’#session.bnumber#’
AND c.Password=’företag’
</CFQUERY>

<CFSET session.comp_id =company_id.comp_id>
<CFSET session.comp_name=company_id.comp_name>
Figure 3.8: Query to find company name and ID
Results from queries such as the example in Figure 3.8 cannot be used in the
VoiceXML code without the CFOUTPUT tag. This tag allows ColdFusion variables
to be placed in VoiceXML code as seen in the example in Figure 3.9.
<assign name="transfer_number"
expr="<CFOUTPUT>#query_phonenumber.jobphone#</CFOUTPUT>
Figure 3.9: Example of ColdFusion output
This example is an assign command in VoiceXML where it assigns the variable
transfer number the value of expr which is a CFOUTPUT expression evaluating into
a telephone number. The system evaluates the CFOUTPUT expression which refer-
ences a query result. The value of the query result is then placed in the VoiceXML
variable transfer number.
The opposite is done as well where ColdFusion variables are assigned the value
of a VoiceXML variable. But, in this direction, there is an extra step. Here, it is
necessary to send the variable to a new document and then place it in a CFSET
statement.
CFOUTPUT and CFQUERY tags are used to integrate dynamics into the static
code. Two of the important dynamic parts are the dynamic grammars and dynamic
prompts which are discussed below.
21

3.2.5 Dynamic Grammars
Dynamic grammars hold query results from the database and are used as grammars
in the same way as static grammars except dynamic grammars change depending on
the information in the database. Grammars become dynamic by using ColdFusion.
The code for the name grammar is seen in Figure 3.10.

<CFQUERY NAME="possible_names" DATASOURCE="telefonist">
SELECT t.Name as possiblename
FROM Tilltal t, company c, employee e
WHERE c.id=#session.comp_id#
AND t.CompanyID=c.id
AND t.emplyeeID=e.id
AND NOT e.id = #session.first_person_Id#
</CFQUERY>

<grammar mode="voice" version="1.0">
<rule id="fullname" scope="public">
<one-of>
<CFOUTPUT QUERY="possible_names"><item>#possiblename#</item></CFOUTPUT>
</one-of>
</rule>
</grammar>
Figure 3.10: Dynamic Grammar
Instead of hard-coding each possible name into the grammar like the static code,
ColdFusion pulls out each possible name that satisfies the query. That is the grammar
VoiceXML then uses to match the user input to.
An interesting aspect of this grammar is in the query where the employeeID can
not be the same as the session variable first person id. This is so that if the recognizer
recognizes the wrong name the first time around, that name will no longer be an
option when the person is directed back to the first question again. Otherwise it pulls
out all the names in the Tilltal table where the company ID is the same as the session
variable companyID and the ID of the Employee matches the employeeID of the
Tilltal table.
The number grammar is kept static since there are only three options. But the
question to the caller is made dynamic by asking them only about the numbers that
are available for that employee. If there is only one number for an employee, the call
is transferred immediately.
22

3.2.6 Dynamic Prompts
In order for the receptionist to sound natural, the speech synthesis prompts used
in the static receptionist are replaced with audio prompts. These audio prompts are
recorded individually for each company and stored in a catalogue of that company.
These prompts need to be dynamic as well so that they reference the appropriate
prompts. This is done by placing a CFOUTPUT tag in the call to the audio prompt
as seen in the example in Figure 3.11. Here it is the appropriate company’s name so
that the system searches in the appropriate catalogue for the prompts.
<audio src=
"http://path/<CFOUTPUT>#session.comp_name#</CFOUTPUT>/welcome1.wav"/>
Figure 3.11: Dynamic Prompt
One prompt is individual to the employee which is their name prompt. This
prompt is referenced by the filename that is included in the employee table in the
database and references a file in the appropriate company’s catalogue. This way the
prompts are different for each company and can be formed individually.
3.2.7 Implementing Statistical Element
Once the main part of the code is functional, the next step is to enable statistical
elements that are important for administrative purposes. Some of these are variables
such as start time, the length of the phonecall and the telephone numbers significant
to the call. This is easily accomplished by insert statements in the beginning of the
code that insert all the information that the program can get in the beginning of the
phonecall such as the start time and the telephone numbers. Then, at the end of the
code, an update statement is used to insert the duration of the call along with the
other information. This information is important so that the company that uses the
system can see call logs and analyze telephone costs.
To get an accurate time, I have to use SQL and ColdFusion functions that take
the exact time when the statement is run. The SQL time is more accurate since it
runs when the statement is run whereas the ColdFusion time runs before the actual
VoiceXML code is processed and thus can be slightly inaccurate. The start time is
not a problem since the exact time that the call starts is when the code starts running.
The more difficult part is the end time which is needed to calculate the duration of
the call. Here a SQL timer is more accurate.
Information is also logged about the transfer call. Here the duration of the call
is much easier since a built-in VoiceXML shadow variable exists that records the
duration of the call. Therefore all the information is simply inserted into the database
when the transfer ends.
23

4 Evaluation
4.1 Evaluation Method
To evaluate the receptionist, a part of the method proposed by Walker et al. (1997)
called PARAdigm for DIalogue System Evaluation (PARADISE) is used. PARA-
DISE is a method to evaluate spoken language systems where it is assumed that the
system’s main objective is to maximize user satisfaction. The PARADISE framework
derives a performance result for a dialogue system as a weighted linear combination
of task-based success and dialogue costs (Walker et al., 1997). In order to get an ap-
propriate performance rating many different measures are used such as user turns,
help requests, and recognizer rejects. These measures are weighted and combined
resulting in a performance measure of the dialogue system.
Task completion and user satisfaction scoring which is included in their measures
are the sole contributors to this evaluation. In order to include the other measures,
such as user turns and help requests, the conversation would have to have been re-
corded in a controlled environment such as a studio. I did not have access to such a
controlled environment and, therefore, I was only able to measure user satisfaction
by a survey and task success. User satisfaction has been used to indicate the usab-
ility of the dialogue agent where two factors are relevant, task success and dialogue
costs (Walker et al., 1997). Seeing as user satisfaction is very central to the success
of a dialogue system, I believe by taking a part of the PARADISE strategy, I can still
measure the success of the dialogue system to a certain extent without considering the
other factors in the dialogue (such as user turns or recognizer rejects). This is done by
using the survey used in Walker et al. (1999) that users ﬁll in after being given a task
to complete. This survey can be viewed in Table 4.1. The scores on the survey are
totalled for a user satisfaction score that is represented in percentages. The users are
also able to write comments on different aspects of the system and these comments
are accounted for below.
(8) Call phonenumber. You want to speak to the head of economics at the
company. You do not know the name of the person only the position.
Ring telefonnummer. Du vill prata med ekonomiansvarig p˚a f¨oretaget. Du kan
inte namnet p˚a personen bara den positionen.
Figure 4.1: Task example
24

Table 4.1: User Satisfaction Survey with Average Scores
Question Average Score(%)
1.Var Systemet lätt att först˚a i detta samtal? (Was the system easy to understand?) 88%
2.Förstod systemet vad du sa? (Did the system understand what you said?) 76%
3.Var det lätt att n˚a fram till personen du sökte? (Was it easy to reach the person you asked for?) 73%
4.Var takten av samtalet bra för detta samtal? (Was the pace of the conversation appropriate?) 88%
5.Visste du vad du kunde säga i varje steg i dialogen?
(Did you know what you could say at each step in the conversation?) 80%
6.Hur ofta var systemet sakta att svara i detta samtal? (How often was the system slow in responding?) 73%
7.Fungerade systemet som du förvantade dig i detta samtal? (Did the system behave as you expected?) 84%
Total Average Score 80%
4.2 Testing
Each user is given a task to complete. An example task is listed in Figure 4.1. After
the task is completed, they fill in a user survey which consists of seven questions
which they rate on a scale of 1–5 with 1 being poor and 5 being great. This survey
can be viewed in Table 4.1. Four people have the task of reaching a person at the
company knowing only their position and not their name. Three people have to reach
a person knowing their name, and two other people have to try to reach people that
do not exist at the company. The company consists of five people with at least four
references to each such as their first name, their full name, and their position. Each
person at the company has one to three telephone numbers where they can be reached.
4.2.1 Test Users
The test users are nine adults aged between 25 and 35 years. Six males and three
females participate in the evaluation, and they have varying degrees of experience
with this type of system. A description of the users can be seen in Table 4.2. Four
males and one female of the test users have studied for a computer degree and there-
fore have high computer knowledge whereas the others are not as experienced with
computers. Two of the five people that have studied computers have studied com-
putational linguistics and, therefore, are experienced with this type of system. Some
of the users have been exposed to this type of system before, most often they have
used the Swedish Railways train information system over the phone. Others have no
previous experience with this type of dialogue system.
4.2.2 Evaluation of Results
Of the seven people whose task is to reach a person at the company, six are successful.
The two people who try to reach people that do not work at the company reach
customer service as does the one person who does not reach the person he is trying
to. The ratings in user satisfaction can be seen in the Table 4.2.
The most satisfied users are the users whose task is to reach a person at the
company knowing their name. The users that are given the task of trying to find a
person that does not work at the company comment that they did not know that their
person did not work at the company and that they could have been connected to
25

Table 4.2: User Satisfaction Scores
User Satisfaction User Description
Full Name
86% Female. Low computer knowledge. Little previous experience with dialogue systems
91% Male. High computer knowledge. Much previous experience with dialogue systems
89% Female. High computer knowledge. Some previous experience with dialogue systems
Position
86% Male. High computer knowledge. Much previous experience with dialogue systems
86% Male. High computer knowledge. Little previous experience with dialogue systems
51% Male. Low computer knowledge. No previous experience with dialogue systems
86% Male. Low computer knowledge. No previous experience with dialogue systems
Non-Existant Name
86% Male. High computer knowledge. Little previous experience with dialogue systems
77% Female. Low computer knowledge. Little previous experience with dialogue systems
customer service more quickly. One thought the error handling is good and the other
thought it is ok but would have liked to have been connected to customer service more
quickly. Three people would have liked more instruction as to what the caller can say
at the beginning of the phone call. Three people think that the only slow part in the
conversation is when the call is being transferred while the others do not comment
on any delay. Everybody besides one person thinks the pace of conversation is good,
but one person thinks it is too fast.
The ratings on the different survey questions vary greatly between users although
the overall rating for the majority is between 86% and 91%. The average scores on the
individual questions can be seen in Table 4.1. Most questions receive a score of a 4 or
5 from almost all of the users. The questions regarding the pace of conversation and
the question regarding what the caller could say at each step in the dialogue receive
lower scores from a few people. The two callers that try to reach a person that does
not exist give lower ratings to the questions regarding if the system understood them
and if it was easy to reach the person they were calling.
One user begins by stating a whole sentence when he is asked who he wants
to speak to instead of short prompt. This causes him to receive an error message.
This user does not rephrase himself and thus becomes quickly frustrated with the
system. This results in the low score of 51%. This user does not have much computer
knowledge and no previous experience with this type of system (see Table 4.2).
The user satisfaction is quite high and seven of the nine test users rate the system
between 86% and 91%. The average score for the survey is 80% which reﬂects the
few low scores from some of the test users but individually the majority of the users
rated the system above average.
26

5 Designing the Web Interface
The website is designed as a complement to the receptionist as a way for companies
to change, add, or delete information in the database as well as see call statistics for
the calls made to and transferred within the company. The website is simply designed
using HTML and ColdFusion and is designed with function in mind. ColdFusion is
used to query the database for the information necessary on the website in the same
way that it does for the receptionist application. Each page is simple with links at the
top of the page to navigate to the other pages on the site. The user logs onto the site
using the company’s phone number as their userID and the password speciﬁc to that
company.
The information on the website is divided into three categories — statistics, a
homepage, and employee information. The statistics that the company is interested
in seeing is the information about the calls coming into the company and the calls
being transferred within the company. This information is logged in the database in
the InCall and TransferCall tables as discussed in the previous chapter. The statistics
page is composed of simple queries that return all of the rows from these two tables.
These queries are then outputted in a HTML table using CFOUTPUT. This can be
quite a long list to go through, therefore a summary is given on the homepage.
The homepage is the ﬁrst page the company comes to when it logs on. Here there
is a summary of the call statistics (the number of calls to the company and the number
of calls transferred). The statistics on the homepage is done with a similar query as
the query on the statistics page, but instead of listing all the rows, the number of
rows returned are counted. This number is the sum of calls into the company for the
InCall and the sum of transferred calls in the TransferCall table. These numbers are
outputted on the homepage. The homepage is seen in Figure 5.1. On this page, a link
is given below the statistics to the statistics page if the user wants more information
about the call statistics as discussed above. There is also a link below the statistics
link which links to the employee information page.
Figure 5.1: Home Page
27

The third category of information is employee information. For the employee
page (see Figure 5.2), the ﬁrst page simply outputs a list of the employees (their
name, telephone numbers and position) using a query of the Employee table. Links
are placed next to each employee with the option to change or delete as seen in
Figure 5.2.
Figure 5.2: Employee List
If the user wants to add a new employee, the link at the end of the list takes
them to a new page where there is a blank form (see Figure 5.3) that they ﬁll in
with employee information including the different names with which a person can be
referenced. An insert statement is used which inserts the new employee’s information
into the correct table.
Figure 5.3: Blank form for new employees
If they want to change information about an employee they are taken to a similar
28

form where the form is filled with the current information. Here they can change
any of the information including all the different names listed in the Tilltal table. An
update statement is used for the Employee table where all the new values for each
field replaces the old ones. For the search names, the Tilltal table needs to be updated
and this is done most easily by using a delete statement to delete all of the current
names for that employee and then an insert statement with all the new names inserted.
This is done so that each name doesn’t have to be compared with the other names
which would be the case if an update statement is used, since every name would have
to be compared with the names already in the Tilltal table to insure that two entries
do not have the same information.
If the user wants to delete an employee, they are taken to an intermediate step
which confirms that they really want to delete the employee. This is to insure that
employees are not deleted by accident since the delete function is irreversible. If they
confirm that they want to delete an employee, a delete statement is used first on the
Employee table and then on the Tilltal table to remove all the information about the
employee. The user is then returned to the employee list with that employee removed.
If they do not want to delete that employee, they are returned to an unchanged em-
ployee list.
29

6 Concluding Remarks
In this thesis I have implemented an automatic speech-driven receptionist for Swedish
companies using VoiceXML and ColdFusion. The receptionist is designed to expect
the name or position of an employee at the company. In case the employee can be
reached at several numbers, the application asks which number it should connect to,
based on the numbers in the database. It handles unrecognized names with special
event handling so the dialogue is more pleasing to the caller. If the employee is busy
or doesn’t answer, the caller is asked if they would like to try another number if the
employee has more than one, and otherwise if they would like to try another person.
VoiceXML is used for the telephony application where speech recognition is used to
direct the caller to their desired destination. Pre-recorded audio prompts by a human
speaker are used instead of speech synthesis because it is considered more pleasing
to the user.
In order for the receptionist to run dynamically, a database is designed that stores
the application content for each company. This information is retrieved using Cold-
Fusion. Statistical elements that are necessary for administrative purposes, such as
call length and the time the call started, are also inserted into the database by the
program.
A website is designed to manage the database information where companies are
able to log in, view call statistics, and change, add or delete employee information.
This site is programmed in HTML and ColdFusion.
The telephony application is tested by nine users using a user survey combined
with a task as suggested by the PARADISE framework. Seven of the nine test users
have a satisfaction score of 86% or higher.
6.1 Future Improvements
The receptionist designed and implemented in this thesis is a great solution for smal-
ler companies where the calls are easily directed and transferred to one of the em-
ployees. It is satisfying to the users and easy for the company to manipulate using the
website.
Some suggestions from the evaluation could be implemented in a future version.
To make error handling more clear, the system could give a list of people that work
at the company with positions if the caller is unsuccessful after the ﬁrst time. This
could be desirable for smaller companies where listing names would not be time-
consuming, and in this way a caller would immediately know if the person they wish
to reach works at the company. On the other hand, it would be too tedious for a large
company. Error handling could also be improved by implementing more advanced
techniques such as using information from the unmatched input to understand what
30

the caller is trying to say, instead of discarding the first response if it is not matched
which is done here.
The receptionist has the function of directing a caller to a desired destination.
This idea could be expanded to include a messaging service where employees could
leave messages in the database for potential callers. For example, an employee could
leave a message saying they are at lunch between 12–13. This would be convenient
for the caller so they would not need to try several numbers if they knew when they
would be able to reach the callee easily.
To make the receptionist system even more useful, another system could be added
where employees keep their contacts. In this way, the receptionist would not only
take incoming calls but would also be able to connect outgoing calls. This would
save time for the employees so they wouldn’t have to look up telephone numbers
anymore. They could simply state the name of the person they wish to call, and then
the receptionist would transfer the call.
A receptionist for smaller companies is implemented in this thesis. For large com-
panies, the grammars would become quite large and the risk for overlapping names
is greater. For smaller companies, the simple system-initiated dialogue is efficient as
well as effective as can be seen in the evaluation scores. If a larger company would
like to use the system, it would be a good idea to make the dialogue mixed-initiative
so the user could more easily state their purpose with the phone call.
If a larger company would like the system, it would be a good idea to implement
a synonym builder to handle all the different positions and names at the company. For
example, what would happen if a company included a new position, for example CEO
(in Swedish VD). Each time this happened, someone would have to manually insert
all the synonyms which is time-consuming for someone sitting at the company that
is updating their employees. It is unlikely that they will be able to sit and reflect on
the different ways to refer to one position. With the example, VD, the company may
refer to that position only as VD but it is possible to ask for verkställande direktör
or chefen which are synonyms to VD. At the same time, it is not possible for the
programmer to update the database manually when a new position is added. Ideally
one would want an automatic synonym builder that automatically updates the Tilltal
table for each new position. In this version the system only handles the position that
is included in the Tilltal table and not synonyms of it. Of course, a company may
choose to insert many different variants of a position into the Tilltal table which is
not as large of a job for smaller companies.
In conclusion, the receptionist is designed for smaller companies to receive and
transfer phonecalls that come into the company. Some of the above improvements
could be added to the receptionist system to make it even more useful in a company
environment.
31

Bibliography
Kenneth R. Abbott. Voice Enabling Web Applications: VoiceXML and Beyond.
Apress, 2002.
H. Aust, M. Oerder, F. Seide, and V. Stenbiss. The Philips automatic train timetable
information system. Speech Communication, 17:249–262, 1995.
Niels Ole Bernsen, Hans Dybkaer, and Laila Dybkjaer. Designing Interactive Speech
Systems: From First Ideas to User Testing. Springer-Verlag New York, Inc., 1997.
R. Carlson, J. Edlund, and G. Skantze. Higgins - a spoken dialogue system for in-
vestigating error handling techniques. In Proceedings of ICSLP, 2004.
R. Carlson, S. Hunnicutt, and J. Gustafson. Dialogue management i the Waxholm
system. In Proceedings of Spoken Dialogue Systems, 1995.
Stephen Choularton. Handling speech recognition errors in spoken dialogue systems.
Technical report, Center for Language Technology, Macquarie University, 2004.
Workshop Paper ACL-04.
A. Danesh and K. Motlagh. Mastering ColdFusion 4.5. Sybex, 2000.
Stefan Dobler. Speech recognition technology for mobile phones. Ericsson Review,
no. 3 2000.
Genevieve Gorrell. Recognition error handling in spoken dialogue systems. In Pro-
ceedings of the 2nd International Conference on Mobile and Ubiquitious Mul-
timedia, Linköpings Electronic Conference Proceedings (www), 2003. http:
//www.ep.liu.se/ecp/011/012.
H.P. Grice. Logic and conversation, 1975.
J. Gustafson, L. Bell, J. Boye, J. Edlund, J. Beskow, R. Carlson, B. Granström,
D. House, and M. Wirén. AdApt - a multimodal conversational dialogue system
in an apartment domain. In Proceedings of ICSLP ’00, volume 2, pages 134–137,
2000.
Joakim Gustafson. Developing Multimodal Spoken Dialogue Systems Empirical
Studies of Spoken Human-Computer Interaction. PhD thesis, Kungliga Tekniska
Högskolan, 2002.
Joakim Gustafson, Magnus Lundeberg, and Johan Liljecrantz. Experiences
from the development of August- a multi-modal spoken dialogue system.
http://www.speech.kth.se/august/ids99 augexp.html, 1999.
33

Julia Hirschberg, Marc Swerts, and Diane J. Litman. Corrections in spoken dialogue
systems. In Proceedings of the 6th International Conference of Spoken Language
Processing (ICSLP-2000), Beijing, China, October 2000.
Catrin Norrby. Samtalsanalys. Studentlitteratur, 1996.
Douglas O’Shaughnessy. Speech Communications. IEEE Press, 2000.
Bryan Pellom and Wayne Ward. The CU communicator: An architecture for dialogue
systems. In Proceedings of the 6th International Conference of Spoken Language
Processing (ICSLP-2000), 2000.
J. Shin, S. Narayanan, L. Gerber, A. Kazemzadeh, and D. Byrd. Analysis of user
behavior under error conditions in spoken dialog. ICSLP, Sep 2002.
Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C, 2003. http:
//www.w3.org/TR/voicexml20.
Marilyn Walker, Diane Litman, and Candace Kamm. Evaluating spoken language
systems. In Proceedings of the American Voice Input/Output Society (AVIOS),
May 1999.
Marilyn Walker, Diane Litman, Candace Kamm, and Alicia Abella. PARADISE:
A framework for evaluating spoken dialogue agents. In Proceedings of the 35th
Annual Meeting of the Association of Computational Linguistics. ACL 97, 1997.
Wikipedia. Global positioning system. http://en.wikipedia.or/wiki/Gps.
34

2005_matzon

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to 2005_matzon

Similar to 2005_matzon (20)

2005_matzon