1. Implementing Task-Oriented Dialogues on Turtlebot 2
Mahima Ghale, Caitlin Coggins, Rebecca Kim, Raeesa Mehjabeen
Interactive Computing Research Lab
Mount Holyoke College, Department of Computer Science
Professor Heather Pon-Barry
turtlebot edit.jpg
Text to speech(TTS) is a speech
synthesizer that converts text input into
speech output. Google TTS was used
because its voice output flows smoothly and
sounds most human-like out of all that were
tried during this summer research.
Future works for this research involve improving speech recognition by using acoustic modeling in Pocketsphinx, switching to Kaldi,
and/or improving input audio quality by either placing the Kinect on top of the Turtlebot. Dialogues can be made more natural by finding
ways to signal (using LEDs, beep sound etc.) the user when the turtlebot is ready to listen, and using a mixed-initiative interaction and
varying patterns in the dialogue. Localization and navigation will need to be refined by customizing the SLAM algorithm so that the
Turtlebot can recover from sudden obstacles quickly and efficiently.
Future Works
Text To Speech (TTS)
Kinect
Figure 4. The process of running Google TTS on the
Turtlebot
Acknowledgements
We would like to thank Professor Heather Pon-Barry for providing us with the
opportunity to work on this project, the Clare Boothe Luce Fund and Mount Holyoke
LYNK Fund for providing necessary funding, and the Computer Science Department
for constant help and support. We would also like to thank Joydeep and his team in
AMRL at University of Massachusetts for helping us set up the Turtlebot.
Navigation, Mapping, and Localization
For Navi to be able to go to specific rooms, it must create
a map (mapping), be able to read the map, keep track of
its position in the map (localization) and calculate a path
to the desired destination (navigation). For this purpose,
we used a ROS package called turtlebot_navigation,
which implements the SLAM (Simultaneous Localization
and Mapping) algorithm.
The Kinect’s 3D Sensors detect walls and everything it
considers to be an obstacle, which are then saved as a
map. During the research, several places inside the lab
were marked with room numbers for convenience. When
given a map of the environment and the Navi’s initial
position, the turtlebot_navigation package calculates a
path to reach its destination.
particular word or a group of words in a
phrase or a sentence, enables Navi to
understand the user as long as a keyword
is found in the user’s speech or utterance.
This allows the user to answer Navi’s
questions freely, without having to follow a
dialogue script. The conversation was
converted into a Turtlebot-readable format
using GraphML, an XML representation of
a graph containing nodes and edges.
Dialogues
Figure 2. A part of GraphML from the Turtlebot’s Dialogue.The yellow
boxes above are nodes(Turtlebot’s speech) and the thin arrows with text
labels are edges(keywords from user’s speech).
ASR TTS TaskGraph
Figure 5. Kinect, with
its labeled parts, used
for ASR, as well as for
mapping and
navigation
Turtlebot 2 is a service robot that
should be able to perform tasks
for its users. The goal for this
summer was to enable it to deliver
items or guide a
Abstract
visitor to a
room. To make
this possible,
the main focus
was on
behavior and
speech
recognition,
which allows
users to ask
the TurtleBot
for help rather
than typing in
instructions on
a computer.
Figure 1. Turtlebot 2,
named Navi, in
Interactive
Computing Research
Lab (ICRL)
Figure 5. The map of Interactive Computing Research
Lab (ICRL) created by driving Navi around through
teleoperation, using keyboards to control where it
moves.
Automatic Speech Recognition (ASR)
Pocketsphinx is an open-source, speaker-independent, and continuous
speech recognition engine. Although more challenging to install and use,
Pocketsphinx has a much better recognition quality than that of Rospeex.
Users can fine-tune Pocketsphinx by creating a new dictionary, which
contains a list of pronunciations of words that the TurtleBot can recognize.
Grammar also makes it easier to figure out which words from the dictionary
were spoken.
Google ASR is a closed-source, online ASR system that converts audio to
text. It returns several texts that may correspond to the audio input, and
also provides a confidence level.
Automatic Speech Recognition(ASR) is a process by which a computer translates a person’s speech into text. Several different ASR
engines were used, including Rospeex, Pocketsphinx, and Google ASR.
Rospeex is a Robot Operating System (ROS) package. While simple to install and use, Rospeex provided the worst recognition of all
ASR systems tested. The package is closed sourced, so there is no way to improve the recognition.
Figure 3. Several scripts are need to run Pocketsphinx.
Kinect is a Microsoft
sensor add-on for the
Xbox Gaming console. It
consists of a microphone
array, 3D Depth Sensors,
and an RGB camera.
A task-oriented dialogue (conversation) was developed based on the information required by Navi in
order to perform a task. In order to make the conversation as unconstrained (meaning that users don’t have
to follow an exact script to converse with Navi) as possible, word spotting and regular expression (regex)
were adopted. Regex, which can find a