Speech recognition challenges

Speech Recognition Challenges

Presenter: Alexandru Chica

Contents

Speech User Interface basic concepts

•Speech recognition

•Speech synthesis


•Accuracy

•User responsiveness

•Performance

•Reliability

•Fault tolerance


Speech Recognition

•The translation of spoken text into written text

algorithm

"#'spit&S#" "speech"

•Statistical Processing
Phonetic representation
•Hidden Marcov Models
of speech
•Dynamic Time Warping

Types of speech recognition:
•Command and control
•Dictation


Speech Recognition Components

•Audio input (front-end)
•Grammars – contain commands that can be spoken by the user
•Acoustic models – language dependant, used to “define” the language features
•Recognition algorithms (back-end)

Back end

feature extraction result
Audio input / Acoustic Recognition
Grammars
models algorithms
Front end


Speech Recognition APIs

Microsoft SAPI IBM: Embedded ViaVoice

Nuance: VoCon VoiceBox Speech Recognition


Speech Synthesis

•The translation of written text into spoken text

g2p
"speech" "#'spit&S#"


Speech Synthesis APIs

Microsoft SAPI SoftVoice TTS Apple PlainTalk

Nuance: Vocalizer SVOX TTS eSpeak

Speech User Interface basic concepts - Usage

In car:
•Control media player / radio stations

•Control navigation

•Control phone book and phone activities

•Find POI locations (POI : point of interests)

•E-mail/SMS reading

On the web:
•HTML 5 speech input

•Google Search with voice input

•Reading of web page content

Speech Recognition Challenges – Accuracy

Audio Input

Problem: Audio signal quality
Impact: loss of recognition accuracy

Solution 1: Echo cancellation

Solution 2: Beamforming

Speech Recognition Challenges – Accuracy

Audio Input

Problem: Talk-over problem
Impact: loss of recognition accuracy

Solution: Barge-In

TTS

User

Speech Recognition Challenges – User responsiveness

Speech Recognition

Problem: resources are not ready and user starts to speak the command
Solution: Delayed speech recognition

Resource loading / Back-end processing
Front-end processing

Delayed Speech Recognition

Speech Recognition Challenges – User responsiveness

Speech Recognition

Problem: synchronization with multiple applications (media, phone, navigation)

Solution: apply concurrent design patterns

•Active Object

•Monitor

•Double-checked locking

Speech Recognition Challenges – Performance

Grammars

Use cases:

• Command & Control grammars
• 200 – 500 commands

•Navigation grammars
• 100k+ static data

•Music grammars
• 10k+ dynamic data


Grammars (1)

Problem: Grammar size too big
Impact:
• increased loading times of files from disk to memory

Solution: Grammar optimization
•merging of similar command tokens


Grammars (2)

•removal / replacement of recursion rules


Grammars (3)

Problem: Grammar token collisions
Impact:
• loss of recognition accuracy
Solution:
•replacement of collision prone tokens with synonyms
•adding special pronunciation tokens to collision words

Examples:

sum – sun – sung

bet – bed


Dynamic Grammars

Problem: synchronization with USB devices, phones, navigation databases takes
too much time

Solution 1: implementation of a caching mechanism

Use id3 parser to read from mp3 files
Title: One
titles, artists, composers, genre, album.
Artist: U2,
etc. Album: Achtung Baby,
Genre: rock

...

Phoneme
cache

dynamic transcriptions
grammar add to slot:
title <DYN_TITLE>
artist <DYN_ARTIST>


Dynamic Grammars

Solution 2: split the processing in two, and dispatch part of the work to a different
processor
Use id3 parser to read from mp3 files CPU1
Title: One
titles, artists, composers, genre, album. Artist: U2,
etc. Album: Achtung Baby,
Genre: rock

...

CPU2
CPU1

dynamic CPU2
Preprocessing step
grammar add to slot:
title <DYN_TITLE>
artist <DYN_ARTIST>

Speech Recognition Challenges – Reliability

Reliability - the ability of the system to keep operating over time

Problem: system has to operate correctly over large periods of time

Solution 1: automated tests

Solution 2: drive tests

Speech Recognition Challenges – Fault tolerance

Problem: Recovery from system failures must be possible

Solution:

• system is modeled in a modular manner, with components that
communicate via internal car area network.

• individual components can be restarted without affecting other system
components


TTS & ASR Demo


Questions ?


Thank You

Speech recognition challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Speech recognition challenges

Similar to Speech recognition challenges (20)

Speech recognition challenges