Speech recognition an overview


Published on

Speech Recognition - An Overview

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Speech recognition an overview

  1. 1. Speech Recognition An Overview Submitted To: Submitted By: Name: Dr. M Syamala Devi Name: Varun Jain Designation: Professor Roll No.: 34 Department: DCSA, Panjab University, Chandigarh Class: MCA 5th Semester (M) Panjab University, Chandigarh
  2. 2. Table of Contents  Introduction  Introduction – A closer look  Terms and Concepts in Speech Recognition   Pronunciation   Utterances Grammar How it works  Acoustic Model  Grammar  Recognized Text  Performance Measurement  Standards  Google Search by Voice (GSbV) – A case study  GSbV – A screenshot  GSbV – An example  Conclusions  References
  3. 3. Introduction  Have you ever talked to your computer?  I mean, have you really, really talked to your computer?  Where it actually recognized what you said and then did something as a result?  If you have, then you've used a technology known as speech recognition.  It is the process of converting spoken input to text.  Speech recognition allows you to provide input to an application with your voice.  Speech recognition is thus sometimes referred to as speech-to-text.  In the desktop world, you need a microphone to be able to do this.  The speech recognition process is performed by a software component known as the speech recognition engine.  The primary function of the speech recognition engine is to process spoken input and translate it into text that an application understands.
  4. 4. Introduction – A closer look  Upon hearing a voice as input, a speech recognition application can do one of the following thing:    The application can interpret the result of the recognition as a command. In this case, the application is a command and control application. An example of a command and control application is one in which the caller says “check balance”, and the application returns the current balance of the caller’s account. If an application handles the recognized text simply as text, then it is considered a dictation application. In a dictation application, if you said “check balance,” the application would not interpret the result, but simply return the text “check balance”. For example, in response to hearing "Please say coffee, tea, or milk," you say "coffee" and the speech recognition application you're calling tells you what the flavor of the day is and then asks if you'd like to place an order.
  5. 5. Terms and Concepts in Speech Recognition  There are basically three terms that are majorly used in context of speech recognition concept.  These three are:  Utterance  Pronunciation  Grammar
  6. 6. Utterance  When the user says something, this is known as an utterance.  An utterance is any stream of speech between two periods of silence.  Utterances are sent to the speech engine to be processed.  Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance.  The speech recognition engine is "listening" for speech input.  When the engine detects audio input - in other words, a lack of silence -- the beginning of an utterance is signaled.  Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs.  If the user doesn’t say anything, the engine returns what is known as a silence timeout - an indication that there was no speech detected within the expected timeframe, and the application takes an appropriate action, such as re-prompting the user for input.
  7. 7. Pronunciation  The speech recognition engine uses all sorts of data, statistical models, and algorithms to convert spoken input into text.  One piece of information that the speech recognition engine uses to process a word is its pronunciation, which represents what the speech engine thinks a word should sound like.  Words can have multiple pronunciations associated with them.  For example, the word “the” has at least two pronunciations in the U.S. English language: “thee” and “thuh.”
  8. 8. Grammar  One must specify the words and phrases that users can say to your application.  These words and phrases are defined to the speech recognition engine and are used in the recognition process.  You can specify the valid words and phrases in a number of different ways, but in application, you do this by specifying a grammar.  A grammar uses a particular syntax, or set of rules, to define the words and phrases that can be recognized by the engine.  A grammar can be as simple as a list of words, or it can be flexible enough to allow such variability in what can be said that it approaches natural language capability.
  9. 9. How it works Fig. 1 A basic working model of speech recognition depicting how a general speech recognition works.
  10. 10. Acoustic Model  Acoustic models provide an estimate for the likelihood of the observed features in a frame of speech given a particular phonetic context.  The features are typically related to measurements of the spectral characteristics of a time-slice of speech.  While individual recipes for training acoustic models vary in their structure and sequencing, the basic process involves aligning transcribed speech to states within an existing acoustic model, accumulating frames associated with each state, and re-estimating the probability distributions associated with the state, given the features observed in those frames.
  11. 11. Grammar  A grammar can be as simple as a list of words, or it can be flexible enough to allow such variability in what can be said that it approaches natural language capability.  Grammars define the domain, or context, within which the recognition engine works.  The engine compares the current utterance against the words and phrases in the active grammars.  If the user says something that is not in the grammar, the speech engine will not be able to decipher it correctly.  You can define a single grammar for your application, or you may have multiple grammars.  Chances are, you will have multiple grammars, and you will activate each grammar only when it is needed.
  12. 12. Recognized Text  It is an output of this model that produces a likelihood if what user has provided as input commonly known as recognized text.  This text can be processed in two way by the application using this model:  As a command  As a dictation text  As a command, a further processing is made upon the text which serves as output  As a dictation text, a stream of text is shown as output as a formatted text just like a text can be shown on screen.
  13. 13. Performance Measurement  The performance of a speech recognition system is measurable.  Perhaps the most widely used measurement is accuracy.  Arguably the most important measurement of accuracy is whether the desired end result occurred.  This measurement is useful in validating application design.  For example, if the user said "yes," the engine returned "yes," and the "YES" action was executed, it is clear that the desired end result was achieved.  But what happens if the engine returns text that does not exactly match the utterance?  For example, what if the user said "nope," the engine returned "no," yet the "NO" action was executed?
  14. 14. Standard  There are a lot of standards by World Wide Web Consortium (W3C)  Some of them are enlisted below:  VoiceXML  SRGS (Speech Recognition Grammar Specification)  SISR (Semantic Interpretation for Speech Recognition)  SSML (Speech Synthesis Markup Language)  PLS (Pronunciation Lexicon Specification)  CCXML (Call Control eXtensible Markup Language)  MediaCTRL  MSML (Media Server Markup Language)  MSCML (Media Server Control Markup Language)
  15. 15. Google search by voice: A case study  Mobile web search is a rapidly growing area of interest.  Internet-enabled Smart phones account for an increasing share of the mobile devices sold throughout the world, and most models offer a web browsing experience that rivals desktop computers in display quality.  Users are increasingly turning to their mobile devices when doing web searches, driving efforts to enhance the usability of web search on these devices.  Although mobile device usability has improved, typing search queries can still be cumbersome, error-prone, and even dangerous in some usage scenarios.
  16. 16. GSbV – A screenshot
  17. 17. GSbV – An example Figure 3. An example of a speech recognizer app GSbV
  18. 18. Conclusions  Speech recognition will revolutionize the way people conduct business over the Web and will, ultimately, differentiate world-class e-businesses.  VoiceXML ties speech recognition and telephony together and provides the technology with which businesses can develop and deploy voice-enabled Web solutions TODAY!  These solutions can greatly expand the accessibility of Web-based self-service transactions to customers who would otherwise not have access, and, at the same time, leverage a business’ existing Web investments.  Speech recognition and VoiceXML clearly represent the next wave of the Web.
  19. 19. References  http://cmusphinx.sourceforge.net/wiki/tutorialconcepts  ftp://public.dhe.ibm.com/software/pervasive/info/products/Introduction_to _Speech_Reco gnition.pdf  https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved =0CDAQFjAB&url=http%3A%2F%2Fresearch.google.com%2Fpubs%2Farchive%2F3 6340.pdf&ei=13GVUoW5CY6FrAeZ04GgDg&usg=AFQjCNEWeZqQk1pIkXjtzrNG__ ZR-UUNbA&bvm=bv.57155469,d.bmk&cad=rja  http://en.wikipedia.org/wiki/VoiceXML