Speech Recognition on
  embedded devices

      Louis-Marie Aubert
ECIT – Queen’s University Belfast

    DevDays – Belfas...
What should we expect from
speech recognition?
Speech Recognition success?
•   Natural continuous speech
•   Real-time
•   Large vocabulary (up to 100,000 words)
•   No ...
What are the solutions on the
market?
Existing solutions
• Server-based

  – Telephony, IVR

  – Dictation (Heath care industry)

  – Audio indexing


    Eithe...
Existing solutions
• Desktop-based

  – Real-time dictation

  – Language learning

    Requires a good setup, powerful co...
Existing solutions
• Embedded applications

  – Simple voice commands
    (‘Call-mum’ type command)

  – Disconnected word...
Is it so difficult?
Technical challenge


Speech waveform
                               Transcription

                   Speech
            ...
Technical challenge

Speech waveform              Acoustic feature vectors


                  Spectral
                  ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge

                        Acoustic Models
Acoustic
feature    • 4000 acoustic models
vectors           ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge

                             Phoneme

Acoustic
feature    • 50 in English
vectors                    ...
Technical challenge

                              Triphone

Acoustic
feature    • 2500 in English
vectors                ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge

                                   Word

Acoustic
feature    • Large vocabulary: 64000
vectors       ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge

                       Statistical language model

Acoustic
feature    • Bi-gram / Tri-gram
vectors  ...
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                             ...
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                          ...
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                          ...
Technical challenge

                             Viterbi decoding

Acoustic   • Token passing algorithm
feature    • 5000...
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                          ...
Challenges in embedded systems
• Low computational resources
• Power consumption constraints
• Noisy environment, poor aud...
Why do we want speech
recognition on embedded
devices anyway?
Applications on mobiles
• Complement touch screen interface with
  speech interface
• Speech enable existing mobile applic...
Applications on mobiles
• Speech enable mobile applications




       Rubicon, quot;The Apple iPhone: Successes and Chall...
Applications on mobiles
• Key to safety when driving
  – Text-messaging
  – Satellite-Navigation function

• Voice Memo
  ...
Other markets
• Developing countries
   – Access to information technology for illiterate people
       • Administrative t...
Other applications
• Speech translation
  – IraqCom
Okay, I can’t wait!
Is there anything I can use now?
Upcoming solutions
• Voicemail accessible via text-message,
  email or dedicated application




  – Server-based
  – Requ...
Upcoming solutions
• Nuance Voice Control 2
  – Online search
  – Text-messaging

    • Embedded software for
      simple...
So?
Conclusion
• A truly embedded speech recognition system
  – A range of exciting applications
     • Real-time dictation wi...
Conclusion
• Key to succeed
  – Robustness, accuracy
  – Fast to load and execute
  – Well designed interface
     • Speec...
Questions?
Upcoming SlideShare
Loading in …5
×

Dev Days, Speech Recognition, LM Aubert

2,493 views

Published on

Overview of Automatic Speech Recognition (ASR) for embedded devices
- Large vocabulary, continuous speech recognition.
- Technical overview
- Potential application
- Upcoming alternatives to embedded engines
Presented DevDays, Belfast, UK, 24 April 09
Louis-Marie Aubert, ECIT, Queen's University Belfast

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,493
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
196
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Dev Days, Speech Recognition, LM Aubert

  1. 1. Speech Recognition on embedded devices Louis-Marie Aubert ECIT – Queen’s University Belfast DevDays – Belfast – April 24, 2009
  2. 2. What should we expect from speech recognition?
  3. 3. Speech Recognition success? • Natural continuous speech • Real-time • Large vocabulary (up to 100,000 words) • No training (speaker independent) • Adaptive to speaker accent • Robust against – Background noise – Audio frontend imperfections • N-best hypotheses with confidence value
  4. 4. What are the solutions on the market?
  5. 5. Existing solutions • Server-based – Telephony, IVR – Dictation (Heath care industry) – Audio indexing Either offline or with important delays
  6. 6. Existing solutions • Desktop-based – Real-time dictation – Language learning Requires a good setup, powerful computer, quiet environment Very good accuracy, no training required
  7. 7. Existing solutions • Embedded applications – Simple voice commands (‘Call-mum’ type command) – Disconnected word recognition Small vocabulary and lack of naturalness restricts the range of applications
  8. 8. Is it so difficult?
  9. 9. Technical challenge Speech waveform Transcription Speech ‘Hello world’ Recognizer
  10. 10. Technical challenge Speech waveform Acoustic feature vectors Spectral Analyser ~40 coeff. 10 ms
  11. 11. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  12. 12. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  13. 13. Technical challenge Acoustic Models Acoustic feature • 4000 acoustic models vectors Recognizer • Sub-acoustic unit Transcription Multi-dim. • Functions that score 10 ms of speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation mean and variance 40-long • Sets of vectors of Gaussian mixtures (16) Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  14. 14. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  15. 15. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  16. 16. Technical challenge Phoneme Acoustic feature • 50 in English vectors Recognizer • Differentiable sounds Transcription Multi-dim. • Represent a sequence of senomes: HMM Gaussian mixt. (Hidden Markov Model) Viterbi decoding ‘Hello world’ calculation ‘ah’: ah1 ah2 ah3 Statistical Word Senome Phoneme Language Lexicon ‘l’: Lexicon l1 l2 l3 Lexicon Model
  17. 17. Technical challenge Triphone Acoustic feature • 2500 in English vectors Recognizer • Differentiable sounds in their context Transcription Multi-dim. continuous speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hh-ah+l’: ah1 ah2 ah3 Statistical Senome Phoneme Word ‘ah-l+ow’: l1 l2 l3 Language Lexicon Lexicon Lexicon Model
  18. 18. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  19. 19. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  20. 20. Technical challenge Word Acoustic feature • Large vocabulary: 64000 vectors Recognizer • Represent a sequence of phonemes/triphones Transcription Multi-dim. Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hello’: hh ah l ow Statistical Senome Phoneme Word ‘world’: Language Lexicon w Lexiconl er d Lexicon Model
  21. 21. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  22. 22. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  23. 23. Technical challenge Statistical language model Acoustic feature • Bi-gram / Tri-gram vectors Recognizer • Give the probability of sequence of 2/3 words Transcription Multi-dim. • 64000 words leads to roughly 10 million states / 50 million mixt. Gaussian arcs Viterbi decoding ‘Hello world’ calculation 0.3 mum hello 0.2 Statistical Senome Phoneme dad Word Language Lexicon Lexicon 0.05 Lexicon Model world
  24. 24. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  25. 25. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  26. 26. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  27. 27. Technical challenge Viterbi decoding Acoustic • Token passing algorithm feature • 5000/10000 tokens to propagate every 10 ms vectors Recognizer Transcription • Select the most promising tokens and output Multi-dim. associated sequence of: senomes mixt. Gaussian triphones Viterbi decoding words sentence ‘Hello world’ calculation v1 Statistical Senome Triphone l1 l2 l3 Word ow1 ow2 ow3 Language Lexicon Lexicon Lexicon Model s1 s2 s3 ey1 d1 d3 ~ ey2 million statesd2 250 million arcs 25 v3 / v2 ey3
  28. 28. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  29. 29. Challenges in embedded systems • Low computational resources • Power consumption constraints • Noisy environment, poor audio quality For a truly embedded speech recognition engine that works, we must move away from the pure software approach: • Make the best of all hardware acceleration available • Dedicated chip (accelerator) to unload CPU and relax memory constraints
  30. 30. Why do we want speech recognition on embedded devices anyway?
  31. 31. Applications on mobiles • Complement touch screen interface with speech interface • Speech enable existing mobile applications – Browse complex menus – Easily find items in large libraries, local or online (contacts, music…) – Browse Web and search maps – Games – Compose text-messages, emails…
  32. 32. Applications on mobiles • Speech enable mobile applications Rubicon, quot;The Apple iPhone: Successes and Challenges for the Mobile Industryquot;, 31 March 2008
  33. 33. Applications on mobiles • Key to safety when driving – Text-messaging – Satellite-Navigation function • Voice Memo – Shopping list – Activity scheduler • Market of Speech technology in embedded devices – $125 million in 2006 – $500 million in 2010 Opus Research report, March 2007
  34. 34. Other markets • Developing countries – Access to information technology for illiterate people • Administrative tasks • Education • Social integration • Health-care at home (self-manage diseases) – Exploding market • Chronic diseases • Elderly people (Baby Boomers reach retirement age) • Market for home health care products is evaluated at $4.3 billion today – Place for Speech recognition • Inexperience of patients with electronic interfaces • Poor physical condition (e.g. low vision) • Illiteracy Medical device today, March 2009
  35. 35. Other applications • Speech translation – IraqCom
  36. 36. Okay, I can’t wait! Is there anything I can use now?
  37. 37. Upcoming solutions • Voicemail accessible via text-message, email or dedicated application – Server-based – Require agreement and implementation by the carriers
  38. 38. Upcoming solutions • Nuance Voice Control 2 – Online search – Text-messaging • Embedded software for simple voice command • Server-based engine for large vocabulary speech recognition • Speech Recognition API on Android 1.5
  39. 39. So?
  40. 40. Conclusion • A truly embedded speech recognition system – A range of exciting applications • Real-time dictation with no perceived delay • Natural language interface (ASR + TTS) • Applications independent of the carrier – But… not available yet! • New speech recognition API are arriving soon – Rely on network/server availability – Can still lead to innovative applications
  41. 41. Conclusion • Key to succeed – Robustness, accuracy – Fast to load and execute – Well designed interface • Speech cannot be used on its own • Should be cleverly combined with other interfaces – Graphical – Touch – … – Don’t put customers off by clumsy speech recognition widgets, again!
  42. 42. Questions?

×