Dev Days, Speech Recognition, LM Aubert

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Dev Days, Speech Recognition, LM Aubert - Presentation Transcript

    1. Speech Recognition on embedded devices Louis-Marie Aubert ECIT – Queen’s University Belfast DevDays – Belfast – April 24, 2009
    2. What should we expect from speech recognition?
    3. Speech Recognition success? • Natural continuous speech • Real-time • Large vocabulary (up to 100,000 words) • No training (speaker independent) • Adaptive to speaker accent • Robust against – Background noise – Audio frontend imperfections • N-best hypotheses with confidence value
    4. What are the solutions on the market?
    5. Existing solutions • Server-based – Telephony, IVR – Dictation (Heath care industry) – Audio indexing Either offline or with important delays
    6. Existing solutions • Desktop-based – Real-time dictation – Language learning Requires a good setup, powerful computer, quiet environment Very good accuracy, no training required
    7. Existing solutions • Embedded applications – Simple voice commands (‘Call-mum’ type command) – Disconnected word recognition Small vocabulary and lack of naturalness restricts the range of applications
    8. Is it so difficult?
    9. Technical challenge Speech waveform Transcription Speech ‘Hello world’ Recognizer
    10. Technical challenge Speech waveform Acoustic feature vectors Spectral Analyser ~40 coeff. 10 ms
    11. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
    12. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
    13. Technical challenge Acoustic Models Acoustic feature • 4000 acoustic models vectors Recognizer • Sub-acoustic unit Transcription Multi-dim. • Functions that score 10 ms of speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation mean and variance 40-long • Sets of vectors of Gaussian mixtures (16) Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
    14. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
    15. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
    16. Technical challenge Phoneme Acoustic feature • 50 in English vectors Recognizer • Differentiable sounds Transcription Multi-dim. • Represent a sequence of senomes: HMM Gaussian mixt. (Hidden Markov Model) Viterbi decoding ‘Hello world’ calculation ‘ah’: ah1 ah2 ah3 Statistical Word Senome Phoneme Language Lexicon ‘l’: Lexicon l1 l2 l3 Lexicon Model
    17. Technical challenge Triphone Acoustic feature • 2500 in English vectors Recognizer • Differentiable sounds in their context Transcription Multi-dim. continuous speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hh-ah+l’: ah1 ah2 ah3 Statistical Senome Phoneme Word ‘ah-l+ow’: l1 l2 l3 Language Lexicon Lexicon Lexicon Model
    18. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
    19. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
    20. Technical challenge Word Acoustic feature • Large vocabulary: 64000 vectors Recognizer • Represent a sequence of phonemes/triphones Transcription Multi-dim. Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hello’: hh ah l ow Statistical Senome Phoneme Word ‘world’: Language Lexicon w Lexiconl er d Lexicon Model
    21. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
    22. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
    23. Technical challenge Statistical language model Acoustic feature • Bi-gram / Tri-gram vectors Recognizer • Give the probability of sequence of 2/3 words Transcription Multi-dim. • 64000 words leads to roughly 10 million states / 50 million mixt. Gaussian arcs Viterbi decoding ‘Hello world’ calculation 0.3 mum hello 0.2 Statistical Senome Phoneme dad Word Language Lexicon Lexicon 0.05 Lexicon Model world
    24. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
    25. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
    26. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
    27. Technical challenge Viterbi decoding Acoustic • Token passing algorithm feature • 5000/10000 tokens to propagate every 10 ms vectors Recognizer Transcription • Select the most promising tokens and output Multi-dim. associated sequence of: senomes mixt. Gaussian triphones Viterbi decoding words sentence ‘Hello world’ calculation v1 Statistical Senome Triphone l1 l2 l3 Word ow1 ow2 ow3 Language Lexicon Lexicon Lexicon Model s1 s2 s3 ey1 d1 d3 ~ ey2 million statesd2 250 million arcs 25 v3 / v2 ey3
    28. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
    29. Challenges in embedded systems • Low computational resources • Power consumption constraints • Noisy environment, poor audio quality For a truly embedded speech recognition engine that works, we must move away from the pure software approach: • Make the best of all hardware acceleration available • Dedicated chip (accelerator) to unload CPU and relax memory constraints
    30. Why do we want speech recognition on embedded devices anyway?
    31. Applications on mobiles • Complement touch screen interface with speech interface • Speech enable existing mobile applications – Browse complex menus – Easily find items in large libraries, local or online (contacts, music…) – Browse Web and search maps – Games – Compose text-messages, emails…
    32. Applications on mobiles • Speech enable mobile applications Rubicon, \"The Apple iPhone: Successes and Challenges for the Mobile Industry\", 31 March 2008
    33. Applications on mobiles • Key to safety when driving – Text-messaging – Satellite-Navigation function • Voice Memo – Shopping list – Activity scheduler • Market of Speech technology in embedded devices – $125 million in 2006 – $500 million in 2010 Opus Research report, March 2007
    34. Other markets • Developing countries – Access to information technology for illiterate people • Administrative tasks • Education • Social integration • Health-care at home (self-manage diseases) – Exploding market • Chronic diseases • Elderly people (Baby Boomers reach retirement age) • Market for home health care products is evaluated at $4.3 billion today – Place for Speech recognition • Inexperience of patients with electronic interfaces • Poor physical condition (e.g. low vision) • Illiteracy Medical device today, March 2009
    35. Other applications • Speech translation – IraqCom
    36. Okay, I can’t wait! Is there anything I can use now?
    37. Upcoming solutions • Voicemail accessible via text-message, email or dedicated application – Server-based – Require agreement and implementation by the carriers
    38. Upcoming solutions • Nuance Voice Control 2 – Online search – Text-messaging • Embedded software for simple voice command • Server-based engine for large vocabulary speech recognition • Speech Recognition API on Android 1.5
    39. So?
    40. Conclusion • A truly embedded speech recognition system – A range of exciting applications • Real-time dictation with no perceived delay • Natural language interface (ASR + TTS) • Applications independent of the carrier – But… not available yet! • New speech recognition API are arriving soon – Rely on network/server availability – Can still lead to innovative applications
    41. Conclusion • Key to succeed – Robustness, accuracy – Fast to load and execute – Well designed interface • Speech cannot be used on its own • Should be cleverly combined with other interfaces – Graphical – Touch – … – Don’t put customers off by clumsy speech recognition widgets, again!
    42. Questions?

    + aubertlmaubertlm, 7 months ago

    custom

    361 views, 1 favs, 0 embeds more stats

    Overview of Automatic Speech Recognition (ASR) for more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 361
      • 361 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 27
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories