Voice Recognition

7,130 views
6,946 views

Published on

Published in: Education, Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
7,130
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
788
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Voice Recognition

  1. 1. VOICE RECOGNITION AMRITA MORE – 416 AASHNA PARIKH - 417
  2. 2. INTRODUCTION <ul><li>A user gives a predefined voice instruction to the system through microphone, the system understand this command and execute the required function. </li></ul><ul><li>It facilitates the user to run windows through your voice without use of keyboard or mouse. </li></ul>
  3. 3. KEY TERMS <ul><li>Speaking Modes </li></ul><ul><ul><li>Isolated Words </li></ul></ul><ul><ul><li>Continuous Speech </li></ul></ul><ul><li>Vocabulary sizes </li></ul><ul><li>Language Model </li></ul><ul><li>Acoustic Model </li></ul><ul><li>Dictionary </li></ul>
  4. 4. REALIZATION OF MANDARIN SPEECH RECOGNITION SYSTEM USING SPHINX <ul><li>Mandarin: It is the main language of China spoken by 855 million native speakers. </li></ul><ul><li>Mandarin Continuous Digit Recognition System </li></ul><ul><li>It is a small vocabulary speech recognition system which has only ten identity objects 0-9. </li></ul><ul><li>This technique builds speech recognition system using Sphinx. </li></ul><ul><li>It also includes Pocket sphinx, Sphinx Train, CMUCLMTK. </li></ul>
  5. 5. SPHINX <ul><li>Sphinx is a set of Java classes used in background to recognize the voice. </li></ul><ul><li>It is open source and is provided by Java, </li></ul><ul><li>Sphinx is built on JSAPI. </li></ul><ul><li>It uses HMM algorithm and BNF grammar. </li></ul>
  6. 6. OVERALL PROCESSING
  7. 7. FEATURE EXTRACTION <ul><li>It generates a set of 51 dimension feature vectors which represent important characteristics of speech signals. </li></ul><ul><li>It is used to convert the speech waveform to some type of parametric representation. </li></ul><ul><li>A wide range of possibilities exist for parametrically representing </li></ul><ul><li>the speech signal. Such as LPC (Linear Prediction Coding) and </li></ul><ul><li>MFCC (Mel Frequency Cepstral Coefficients). </li></ul>Speech Data Feature Extraction Text Data Acoustic Model Recognition Engine Training Language Model output
  8. 8. Improved Acoustic Model Training <ul><li>Sphinx Train is the acoustic model training tool. </li></ul>Speech Data Feature Extraction Text Data Acoustic Model Recognition Engine Training Language Model output
  9. 9. Language Model Training Text text2wfreq text2idngram Id-N-gram Wfreq2vocab idngram2lm lm3g2dmp vocab arpa.dmp arpa binlm2arpa
  10. 10. POCKET SPHINX <ul><li>Decoding Engine </li></ul><ul><li>It is used as a set of libraries that include core speech recognition functions. </li></ul><ul><li>Input is audio file in wave format and the final output of recognition is displayed as text. </li></ul>Language Model Text Data Speech Data Feature Extraction Acoustic Model Recognition Engine Training output
  11. 11. HIDDEN MARKOV MODEL (HMM) <ul><li>Real-world has structures and processes which have (or produce) observable outputs: </li></ul><ul><ul><li>Usually sequential (process unfolds over time) </li></ul></ul><ul><ul><li>Cannot see the event producing the output </li></ul></ul><ul><ul><li>Example: speech signals </li></ul></ul>
  12. 12. HMM Background <ul><li>Basic theory developed and published in 1960s and 70s </li></ul><ul><li>No widespread understanding and application until late 80s </li></ul><ul><li>Few Reasons: </li></ul><ul><ul><li>Theory published in mathematic journals which were not widely read by practicing engineers </li></ul></ul><ul><ul><li>Insufficient tutorial material for readers to understand and apply concepts </li></ul></ul>
  13. 13. HMM Overview <ul><li>Machine learning method </li></ul><ul><li>Makes use of state machines </li></ul><ul><li>Based on probabilistic model </li></ul><ul><li>Can only observe output from states, not the states themselves </li></ul><ul><ul><li>Example: speech recognition </li></ul></ul><ul><ul><ul><li>Observe: acoustic signals </li></ul></ul></ul><ul><ul><ul><li>Hidden States: phonemes </li></ul></ul></ul><ul><ul><ul><ul><li>(distinctive sounds of a language) </li></ul></ul></ul></ul>
  14. 14. HMM Components <ul><li>A set of states (x’s) </li></ul><ul><li>A set of possible output symbols (y’s) </li></ul><ul><li>A state transition matrix (a’s): </li></ul><ul><li>probability of making transition from one state to the next </li></ul><ul><li>Output emission matrix (b’s): </li></ul><ul><li>probability of a emitting/observing a symbol at a particular state </li></ul><ul><li>Initial probability vector: </li></ul><ul><ul><li>probability of starting at a particular state </li></ul></ul><ul><ul><li>Not shown, sometimes assumed to be 1 </li></ul></ul>
  15. 15. Observable Markov Model Example <ul><li>Weather </li></ul><ul><ul><li>Once each day weather is observed </li></ul></ul><ul><ul><ul><li>State 1: rain </li></ul></ul></ul><ul><ul><ul><li>State 2: cloudy </li></ul></ul></ul><ul><ul><ul><li>State 3: sunny </li></ul></ul></ul><ul><ul><li>What is the probability the weather for the next 7 days will be: </li></ul></ul><ul><ul><ul><li>sun, sun, rain, rain, sun, cloudy, sun </li></ul></ul></ul><ul><ul><li>Each state corresponds to a physical observable event </li></ul></ul>Rainy Cloudy Sunny Rainy 0.4 0.3 0.3 Cloudy 0.2 0.6 0.2 Sunny 0.1 0.1 0.8
  16. 16. Common HMM Types <ul><li>Ergodic (fully connected): </li></ul><ul><ul><li>Every state of model can be reached in a single step from every other state of the model </li></ul></ul><ul><li>Bakis (left-right): </li></ul><ul><ul><li>As time increases, states proceed from left to right </li></ul></ul>
  17. 17. HMM Advantages <ul><li>Advantages: </li></ul><ul><ul><li>Effective </li></ul></ul><ul><ul><li>Can handle variations in record structure </li></ul></ul><ul><ul><ul><li>Optional fields </li></ul></ul></ul><ul><ul><ul><li>Varying field ordering </li></ul></ul></ul>
  18. 18. HMM Uses <ul><ul><li>Speech recognition: </li></ul></ul><ul><ul><li>Recognizing spoken words and phrases </li></ul></ul><ul><ul><li>Text processing: </li></ul></ul><ul><ul><li> Parsing raw records into structured records </li></ul></ul><ul><ul><li>Bioinformatics: </li></ul></ul><ul><ul><li>Protein sequence prediction </li></ul></ul><ul><ul><li>Financial: </li></ul></ul><ul><ul><ul><li>Stock market forecasts (price pattern prediction) </li></ul></ul></ul><ul><ul><ul><li>Comparison shopping services </li></ul></ul></ul>
  19. 19. THE LEXICAL ACCESS COMPONENT OF THE CMU CONTINUOUS SPEECH RECOGNITION SYSTEM <ul><li>The CMU Lexical Access System hypothesizes words from a phonetic dictionary. </li></ul><ul><li>Word hypothesis are anchored on syllabic nuclei and are generated independently for different parts of the utterance. </li></ul>EXAMPLES WORD SYLLABIC NUCLEI Cat [kæt] [æ]
  20. 20. Word Hypothesizer System Diagram Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  21. 21. MATCHING ENGINE <ul><li>Words are hypothesized by matching an input sequence of labels against the stored representation of the possible pronunciation. </li></ul><ul><li>It uses the Beam search algorithm which is a modified best first search strategy. </li></ul><ul><li>The beam search algorithm can simultaneously search paths with different lengths. </li></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  22. 22. THE LEXICON <ul><li>The lexicon (dictionary) is stored in the form of a phonetic network. </li></ul><ul><li>The sources of pronunciations that have been used: </li></ul><ul><ul><li>On-line phonetic dictionary, such as the Shop Dictionary. </li></ul></ul><ul><ul><li>Letter-to-sound compiler (The Talk System). </li></ul></ul><ul><li>The current CMU lexicon is constructed using a base over 150 rules covering several types of phenomena: </li></ul><ul><ul><li>Including co-articulator phenomena. </li></ul></ul><ul><ul><li>Front-end characteristics. </li></ul></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  23. 23. ANCHOR GENERATION <ul><li>To eliminate unnecessary matches, the voice recognition system uses syllable anchors to select locations in an utterance where words are to be hypothesized. </li></ul><ul><li>The anchor generation algorithm is straight forward and is based on the following reasoning: </li></ul><ul><ul><li>Words are composed of syllables, and all the syllable contain a vocalic center. </li></ul></ul><ul><ul><li>Word divisions cannot occur inside vocalic center. </li></ul></ul><ul><ul><li>The coarse labeler provides information about vocalic, non-vocalic and silent regions. </li></ul></ul><ul><li>The algorithm is implemented in such a way that the “best” hypothesis will be generated. </li></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  24. 24. ANCHORS HAVE BEEN USED IN THE SYSTEM IN 2 MODES: <ul><li> Single Anchor: </li></ul><ul><ul><li>In the single anchor mode, anchors of different lengths are generated and the matcher is invoked separately for each one. Although this procedure is simple, its inefficient too. </li></ul></ul><ul><ul><li>Multiple Anchor: </li></ul></ul><ul><ul><ul><li>The multiple anchor mode, reduces the computations, and also reduces the number of hypothesis generated. </li></ul></ul></ul>
  25. 25. COARSE LABELER <ul><li>The coarse labeling algorithm is based on the ZAPDASH (Zero-crossing And Peak to peak amplitude of Differenced And Smoothed data) algorithm. </li></ul><ul><li>The algorithm is robust and speaker independent, and operates reliably over a large dynamic range. </li></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  26. 26. PHONETIC LATTICE INTEGRATOR <ul><li>The phonetic labels produced by the front-end are grouped into four separate lattices: vowels, fricatives, closures and stops. </li></ul><ul><li>The role of the integrator is to combine these separate streams and produce a single lattice consisting of non-overlapping segments. </li></ul><ul><li>The integrator maps the label space used by the front-end into the label space used in the lexicon. </li></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  27. 27. JUNCTION VERIFIER <ul><li>The verifier basically examines junctures between words and determines whether these words can be connected together in sequence. </li></ul><ul><li>The verifier deals with three classes of junctures: </li></ul><ul><ul><li>Abutments </li></ul></ul><ul><ul><li>Gaps </li></ul></ul><ul><ul><li>Overlaps </li></ul></ul>Parser Verifier Coarse labeler Anchor Generator Lexicon Matcher Front End Lattice Integrator
  28. 28. CONCLUSION <ul><li>Its not nearly enough detailed to actually write a speech recognizer, but it exposes the basic concepts. </li></ul><ul><li>The basic concepts we learnt today to implement speech recognition are: </li></ul><ul><ul><li>Sphinx </li></ul></ul><ul><ul><li>Lexical Access System </li></ul></ul><ul><ul><li>HMM Model </li></ul></ul><ul><li>The real life implementations of these techniques are still in the development phase while some are successfully launched. </li></ul><ul><li>Example: Winvoice using Sphinx. </li></ul>
  29. 29. REFERENCES <ul><li>Alexander I. Rudnicky, Lynn k. Baumeister, Kevin H. DeGraaf, “ The Lexical Access Component of The CMU Continue Speech Recognition ”, pp. 376-379, 1987, IEEE. </li></ul><ul><li>Yun Wang Xueying Zhang, “ Realization of Mandarin Continuous Digits Speech Recognition ”, pp. 378-380, 2010, IEEE. </li></ul><ul><li>Todd A. Stephenson, “ Speech Recognition with Auxiliary Information ”, pp. 189-203, 2004, IEEE. </li></ul>
  30. 30. ANY QUESTIONS....?
  31. 31. THANK YOU....!!

×