This document summarizes research on improving speech recognition for contextualized service robots. It proposes four approaches: 1) using contextual language models specific to the robot's tasks, 2) using prompting beeps to signal when to speak, 3) implementing recovery strategies like asking for repeats, and 4) calibrating audio settings for noise. It finds that contextual models reduced word error rate by 17.2% compared to a single model. Beeps reduced error by 30-4% and recovery strategies were triggered 16.87% of the time, allowing more practical speech recognition for service robots.
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Micai 13 contextualized practical speech
1. Departamento de Ciencias de la Computación
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas
Universidad Nacional Autónoma de México
Practical Speech Recognition for
Contextualized Service Robots
Ivan Meza, Caleb Rascón and Luis Pineda
http://golem.iimas.unam.mx/
GrupoGolem
2. Service robots
● Our future butlers
● They are task oriented
○ Clean up a room
○ Play a game
●
●
●
●
Interaction with spoken language
They work in noisy environments
Microphone is not close to the speaker
Poor speech recognition
3. Proposal
● Improve the system on four aspects
● Contextualized recogniser
● Prompting strategies
● Recovery strategies
● Audio calibration
4. I. Contextualized recognition
● Use specific language models for the
given expectations
■ YES: yes, okay, all right
■ NO: no, don’t, do not
■ NAVIGATE: go to the kitchen, go to the living
room, go to the bedroom
6. II. Prompting strategies
● Let know the user when to speak
■ Beep sound
● Speaker volume monitor
■ Could you speak louder or softer
7. III. Recovery strategy
● Let know the user when something
went wrong
■ could you repeat?
■ i can’t hear you well, could you repeat
■ sorry, i’m a little deaf
8. IV. Calibration of audio setting
● Hardware
■ 1 directional microphone
■ 1 USB interface with 4 channels
■ 2 speakers
● Calibration of SNR in situ
■ For background noise -58dB
■ SNR set to 20 dB
9. Corpus evaluation
● Logs from the robot performing
RoboCup tasks
■
■
■
■
■
■
■
2 years interactions in lab and competition
1,439 utterances
2,472 tokens
120 types
11 tasks
9 of 11 tasks are contextualized
14 language models
10. Contextualized recognition
We measure WER (the lower the better)
● With a unique LM for all tasks: 53.84%
● With task-based LM: 28.28%
● With contextualized: 23.42%
17.2% relative error reduction
11. Beep sound
● 79 utterances were recorded without the
beep sound
■ Without beeps 55.86%
■ With beeps 39.75%
■ With beeps full 53.72%
30%-4% Relative error reduction
12. Usage of SoundLoc System
● We measure usage
■ 174 times could have been triggered
■ 21 soft speech
■ 4 louder
14.36% of the times
13. Recovery strategy
● We measure usage
■ 504 times could have been triggered
■ 85 times activated
16.87% of the times
14. Conclusions
● These strategies help to improve in small
amounts the performance
● Together they allow practical speech
recognition on a service robot