Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Speech Recognition Technology

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words.

  • Be the first to comment

Speech Recognition Technology

  1. 1. Visit to Download
  2. 2. Introduction • Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. • The recognized words can be an end in themselves, as for applications such as commands & control, data entry, and document preparation. • They can also serve as the input to further linguistic processing in order to achieve speech understanding. • It is also known as Automatic Speech Recognition (ASR) ,computer speech recognition, speech to text (STT).
  3. 3. History • Around since the 1960s, ASR has seen steady, incremental improvement over the years. • It has benefited greatly from increased processing speed of computers in the last decade, entering the marketplace in the mid-2000s. • Early systems were acoustic phonetics-based and worked with small vocabularies to identify isolated words. • Over the years, vocabularies have grown while ASR systems have become statistics-based • They now have large vocabularies and can recognize continuous speech.
  4. 4. Basic Structure
  5. 5. Digital Sampling • When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. • To do this, it samples, or digitizes, the sound by taking precise measurements of the wave at frequent intervals. • The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency.
  6. 6. Acoustic model • Next the signal is divided into small segments as short as a few hundredths of a second, or even thousandths in the case of plosive consonant sounds -- consonant stops produced by obstructing airflow in the vocal tract -- like "p" or "t." • The program then matches these segments to known phonemes in the appropriate language. • A phoneme is the smallest element of a language -- a representation of the sounds we make and put together to form meaningful expressions.
  7. 7. Language model • The program examines phonemes in the context of the other phonemes around them. • It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. • The program then determines what the user was probably saying and either outputs it as text or issues a computer command.
  8. 8. Statistical Modeling Systems • These systems use probability and mathematical functions to determine the most likely outcome. • The two models that dominate the field today are the Hidden Markov Model and Neural Networks. • These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it.
  9. 9. Hidden Markov Model (HMM) • In this model, each phoneme is like a link in a chain, and the completed chain is a word. • The chain branches off in different directions as the program attempts to match the digital sound with the phoneme that's most likely to come next. • During this process, the program assigns a probability score to each phoneme, based on its built-in dictionary and user training.
  10. 10. Markov Model
  11. 11. Neural Networks A class of statistical models may be called "neural" if they consist of • sets of adaptive weights, i.e. numerical parameters that are tuned by a learning algorithm, and • are capable of approximating non-linear functions of their inputs. The adaptive weights are conceptually connection strengths between neurons, which are activated during training and prediction.
  12. 12. Each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another.
  13. 13. Program Training • The process is more complicated for phrases and sentences -- the system has to figure out where each word stops and starts. • The statistical systems need lots of exemplary training data to reach their optimal performance. • Sometimes on the order of thousands of hours of human-transcribed speech and hundreds of megabytes of text. • The training data are used to create acoustic models of words, word lists and multi-word probability networks. • The details can make the difference between a well-performing system and a poorly-performing system -- even when using the same basic algorithm.
  14. 14. Applications • Transcription • dictation, information retrieval • Command and control • data entry, device control, navigation, call routing • Information access • airline schedules, stock quotes, directory assistance • Problem solving • travel planning, logistics
  15. 15. Weaknesses and Flaws • Low signal-to-noise ratio - The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the sound will interfere with this. • Overlapping speech- Current systems have difficulty separating simultaneous speech from multiple users. • Intensive use of computer power. • Homonyms e.g. "There" and "their," "air" and "heir," "be" and "bee"
  16. 16. Major Challenges • Making a system that can flawlessly handle roadblocks like slang, dialects, accents and background noise. • The different grammatical structures used by languages can also pose a problem. For example, Arabic sometimes uses single words to convey ideas that are entire sentences in English.
  17. 17. The Future of Speech Recognition • The Defense Advanced Research Projects Agency (DARPA) has three teams of researchers working on Global Autonomous Language Exploitation (GALE), a program that will take in streams of information from foreign news broadcasts and newspapers and translate them. • It hopes to create software that can instantly translate two languages with at least 90 percent accuracy. • "DARPA is also funding an R&D effort called TRANSTAC to enable the soldiers to communicate more effectively with civilian populations in nonEnglish-speaking countries.
  18. 18. Conclusion At some point in the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words. Although it is a huge leap in terms of computational power and software sophistication, some researchers argue that speech recognition development offers the most direct line from the computers of today to true artificial intelligence.
  19. 19. References • • • • • • •

    Be the first to comment

    Login to see the comments

  • kmkusum

    Mar. 25, 2019
  • hosseinismaeli

    May. 4, 2019
  • Manisha1812

    May. 13, 2019
  • Jetlk

    May. 13, 2019
  • pritamchinde

    Jul. 17, 2019
  • PSwati

    Aug. 27, 2019
  • PranjaliTiwari1

    Feb. 17, 2020
  • salmabencharfa

    Feb. 19, 2020
  • MumtahinaOrthy

    Mar. 9, 2020
  • gehadomran1

    May. 15, 2020
  • SamreenSultana17

    May. 27, 2020

    Aug. 12, 2020
  • SamarAlaa3

    Sep. 2, 2020
  • prakashkhanal5

    Sep. 5, 2020
  • AbhiramiRadhakrishna1

    Sep. 5, 2020
  • jishnun2

    Oct. 2, 2020
  • santhysusan

    Nov. 12, 2020
  • PawanNegi25

    Jan. 2, 2021
  • sauravmishra56

    Jan. 3, 2021
  • MahammadAnsaf

    May. 3, 2021

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words.


Total views


On Slideshare


From embeds


Number of embeds