A utomatic  S peech   R ecognition <ul><li>By: </li></ul><ul><li>Khalid El-Darymli </li></ul><ul><li>G0327887 </li></ul>
OUTLINE <ul><li>Automatic Speech Recognition ASR: </li></ul><ul><ul><li>Definition, </li></ul></ul><ul><ul><li>capabilitie...
Multilayer Structure of speech production: <ul><li>[book_airplane_flight] [from_locality] [to_locality] [ departure_time] ...
What is  S peech  R ecognition ? <ul><li>Process of converting acoustic signal captured by microphone or telephone to a se...
Capabilities of ASR including: <ul><li>Isolated word recognizers:  for segments separated by pauses. </li></ul><ul><li>Wor...
Uses and Applications  <ul><li>Dictation:  This includes medical transcriptions, legal and business dictation, as well as ...
A Timeline & History of Voice Recognition Software Dragon released discrete word dictation-level speech recognition softwa...
… timeline…continue Scansoft, Inc. is presently the world leader in the technology of Speech Recognition in the commercial...
The Structure of ASR System: Functional Scheme of an ASR System Speech samples X Y S W * Database Signal  Interface Featur...
Speech Database: <ul><li>A speech database is a collection of recorded speech accessible on a computer and supported with ...
Transcription of speech: <ul><li>Ex.: </li></ul><ul><li>The graph below shows an acoustic waveform for the sentence:  how ...
Many databases are distributed by the  Linguistic Data Consortium   www.ldc.upenn.edu
Speech Signal Analysis Feature Extraction for ASR: - The aim is to extract the voice features to distinguish different pho...
MFCC extraction: <ul><li>Pre-emphasis:  to obtain similar amplitude for all formants. </li></ul><ul><li>Windowing:  to ena...
Spectral Analysis: <ul><li>DFT:    X t (k)  =X t (e j2  k/N ), k=0, …, N-1 </li></ul><ul><li>Filter Bank processing:  to...
Speech waveform of a phoneme “ae” <ul><li>Explanatory Example </li></ul>After pre-emphasis and Hamming windowing Power spe...
Training  and  Recognition : <ul><li>Training:   </li></ul><ul><li>- The model is built from a large number of different c...
Deterministic  vs.  Stochastic  framework: <ul><li>Deterministic framework: </li></ul><ul><li>DTW:  One or more acoustic t...
Implementing  HMM  to speech Modeling   Training  and  Recognition <ul><li>The recognition procedure may be divided into t...
Implementation of HMM: <ul><li>HMM of a simple grammar:  </li></ul><ul><li>“  sil ,  NO ,  YES ” </li></ul>P(w t =yesw t-1...
The search Algorithm: <ul><li>Hypothesis tree of the Viterbi search algorithm </li></ul>s (0) s (7) s (0) s (1) s (8) s (7...
Conclusions: <ul><li>Modern speech understanding systems merge interdisciplinary technologies from Signal Processing, Patt...
Upcoming SlideShare
Loading in …5
×

Asr

1,648 views

Published on

ASR

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
1,648
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
83
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Asr

  1. 1. A utomatic S peech R ecognition <ul><li>By: </li></ul><ul><li>Khalid El-Darymli </li></ul><ul><li>G0327887 </li></ul>
  2. 2. OUTLINE <ul><li>Automatic Speech Recognition ASR: </li></ul><ul><ul><li>Definition, </li></ul></ul><ul><ul><li>capabilities, </li></ul></ul><ul><ul><li>usage, </li></ul></ul><ul><ul><li>and milestone. </li></ul></ul><ul><li>Structure of an ASR system: </li></ul><ul><ul><li>Speech database. </li></ul></ul><ul><ul><li>MFCC extraction. </li></ul></ul><ul><ul><li>Training & Recognition. </li></ul></ul><ul><li>Conclusions. </li></ul>
  3. 3. Multilayer Structure of speech production: <ul><li>[book_airplane_flight] [from_locality] [to_locality] [ departure_time] </li></ul>[I] [would] [like] [to] [book] [a] [flight] [from] [Rome] [to] [London][tomorrow][morning] [book]  [b/uh/k] Pragmatic Layer Semantic Layer Syntactic Layer Prosodic/Phonetic Layer Acoustic Layer
  4. 4. What is S peech R ecognition ? <ul><li>Process of converting acoustic signal captured by microphone or telephone to a set of words. </li></ul><ul><li>Recognized words can be final results, as for applications such as commands and control, data entry and document preparation. </li></ul><ul><li>They can also serve as input to further linguistic processing in order to achieve speech understanding. </li></ul>
  5. 5. Capabilities of ASR including: <ul><li>Isolated word recognizers: for segments separated by pauses. </li></ul><ul><li>Word spotting: algorithms that detect occurrences of key words in continuous speech. </li></ul><ul><li>Connected words recognizers: that identify uninterrupted, but strictly formatted, sequences of words (e.g. recognition of telephone numbers). </li></ul><ul><li>Restricted speech understanding: systems that handle sentences relevant to a specific task. </li></ul><ul><li>Task independent continuous speech recognizers: which is the ultimate goal in this field. </li></ul><ul><li>Two types of systems: </li></ul><ul><ul><li>Speaker-dependent: user must provide samples of his/her speech before using them, </li></ul></ul><ul><ul><li>Speaker independent: no speaker enrollment necessary. </li></ul></ul>
  6. 6. Uses and Applications <ul><li>Dictation: This includes medical transcriptions, legal and business dictation, as well as general word processing. </li></ul><ul><li>Command and Control: ASR systems that are designed to perform functions and actions on the system. </li></ul><ul><li>Telephony: Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones. </li></ul><ul><li>Wearables: Because inputs are limited for wearable devices, speaking is a natural possibility. </li></ul><ul><li>Medical/Disabilities: Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text. </li></ul><ul><li>Embedded Applications: Some newer cellular phones include C&C speech recognition that allow utterances such as “Call Home“ . </li></ul>
  7. 7. A Timeline & History of Voice Recognition Software Dragon released discrete word dictation-level speech recognition software. It was the first time dictation speech & voice recognition technology was available to consumers . 1995 SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) solutions, was founded. 1984 Dragon Systems was founded. 1982 DARPA established the Speech Understanding Research (SUR) program. A $3 million per year of government funds for 5 years. It was the largest speech recognition project ever. 1971 HMM approach to speech & voice recognition was invented by Lenny Baum of Princeton University Early 1970's AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder. 1936
  8. 8. … timeline…continue Scansoft, Inc. is presently the world leader in the technology of Speech Recognition in the commercial market. ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Speech Recognition. 2003 Lernout & Hauspie acquired Dragon Systems for approximately $460 million. 2000 Microsoft invested $45 million to allow Microsoft to use speech & voice recognition technology in their systems. 1998 Dragon introduced &quot;Naturally Speaking&quot;, the first &quot;continuous speech&quot; dictation software available 1997
  9. 9. The Structure of ASR System: Functional Scheme of an ASR System Speech samples X Y S W * Database Signal Interface Feature Extraction Recognition Databases Training HMM
  10. 10. Speech Database: <ul><li>A speech database is a collection of recorded speech accessible on a computer and supported with the necessary annotations and transcriptions. </li></ul><ul><li>The databases collect the observations required for parameter estimations. </li></ul><ul><li>The corpora has to be large enough to cover the variability of speech. </li></ul>
  11. 11. Transcription of speech: <ul><li>Ex.: </li></ul><ul><li>The graph below shows an acoustic waveform for the sentence: how much allowance. </li></ul><ul><li>Speech data are segmented and labeled with the phoneme string “ /h# hh ak m ah tcl cj ax l aw ax n s / ” </li></ul><ul><li>It is linguistic information associated to digital recordings of acoustic signals. </li></ul><ul><li>This symbolic representation of speech used to easily retrieve the content of the databases. </li></ul><ul><li>Transcription involving: </li></ul><ul><li>- Segmentation and Labeling . </li></ul>Segmentation and labeling example
  12. 12. Many databases are distributed by the Linguistic Data Consortium www.ldc.upenn.edu
  13. 13. Speech Signal Analysis Feature Extraction for ASR: - The aim is to extract the voice features to distinguish different phonemes of a language.
  14. 14. MFCC extraction: <ul><li>Pre-emphasis: to obtain similar amplitude for all formants. </li></ul><ul><li>Windowing: to enable spectral evaluation our signal has to be stationary, accordingly, windowing will make it possible to perform spectral evaluation on a short time periods. </li></ul>Pre-emphasis DFT Mel filter banks Log(|| 2 ) IDFT Speech signal x(n) WINDOW x ’ (n) x t (n) X t (k) Y t (m) MFCC y t (m) (k)
  15. 15. Spectral Analysis: <ul><li>DFT:  X t (k) =X t (e j2  k/N ), k=0, …, N-1 </li></ul><ul><li>Filter Bank processing: to obtain the spectral feature of speech thru properly integrating spectrum at defined frequency ranges. </li></ul><ul><ul><li>A set of 24 BPF is generally used since it simulates human ear processing. </li></ul></ul><ul><ul><li>The most widely used scale is the Mel scale. </li></ul></ul><ul><li>UP to here the procedure has the role of smoothing the spectrum and performing a processing that similar to that executed by the human ear </li></ul><ul><li>- Log(|| 2 ) : </li></ul><ul><li>-Again this process performed by the human ear as well. </li></ul><ul><li>The magnitude will discard the useless phase information. </li></ul><ul><li>Logarithm performs a dynamic compression making feature extraction less sensitive to variations in dynamics. </li></ul>
  16. 16. Speech waveform of a phoneme “ae” <ul><li>Explanatory Example </li></ul>After pre-emphasis and Hamming windowing Power spectrum MFCC
  17. 17. Training and Recognition : <ul><li>Training: </li></ul><ul><li>- The model is built from a large number of different correspondences (X ’ , W ’ ). </li></ul><ul><li>- This is the same training procedure of a baby. </li></ul><ul><li>- The greater the number of couples, the greater is the recognition accuracy. </li></ul><ul><li>Recognition: </li></ul><ul><li>- All the possible sequences of words W are tested to find the W * whose acoustic sequence X=h (W * ,  ) best matches the one given. </li></ul>
  18. 18. Deterministic vs. Stochastic framework: <ul><li>Deterministic framework: </li></ul><ul><li>DTW: One or more acoustic templates are memorized per word. </li></ul><ul><li>Drawback : This is not sufficient to represent all the speech variability, i.e. all the X associated to a word. </li></ul><ul><li>Stochastic framework: </li></ul><ul><li>The knowledge is embedded stochastically. </li></ul><ul><li>This allows us to consider a model like ( HMM ) that takes more correspondences (X ’ ,W ’ ) into account. </li></ul>
  19. 19. Implementing HMM to speech Modeling Training and Recognition <ul><li>The recognition procedure may be divided into two distinct stages: </li></ul><ul><li>- Building HMM speech models based on the correspondence between the observation sequences Y and the state sequence ( S ).  (TRAINNING). </li></ul><ul><li>- Recognizing speech by the stored HMM models  and by the actual observation Y.  (RECOGNITION) </li></ul>Training HMM Feature Extraction Recognition W * Y Y S Speech Samples 
  20. 20. Implementation of HMM: <ul><li>HMM of a simple grammar: </li></ul><ul><li>“ sil , NO , YES ” </li></ul>P(w t =yesw t-1 =sil)=0.2 P(w t =sil|w t-1 =yes)=1 P(w t =sil|w t-1 =no)=1 P(w t =now t-1 =sil)=0.2 P(s t s t-1 ) s (0) Silence Start S (1) S (2) S (3) S (4) S (5) S (6) S (7) S (8) S (9) S (10) S (11) S (12) Phoneme ‘ YE ’ Phoneme ‘ S ’ w= YES w= NO Phoneme ‘ N ’ Phoneme ‘ O ’ P(Ys t =s (9) ) Y 0.6
  21. 21. The search Algorithm: <ul><li>Hypothesis tree of the Viterbi search algorithm </li></ul>s (0) s (7) s (0) s (1) s (8) s (7) s (0) s (1) s (2) Time=1 Time=2 Time=3 0.1 0.4 0.1 0.025 0.021 0.051 0.041 0.045 0.036 0.032
  22. 22. Conclusions: <ul><li>Modern speech understanding systems merge interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework. </li></ul><ul><li>Voice commanded applications are expected to cover many of the aspects of our future daily life. </li></ul><ul><li>Car computers, telephones and general appliances are the more likely candidates for this revolution that may reduce drastically the use of the keyboard. </li></ul><ul><li>Speech Recognition is nowadays regarded by market projections as one of the more promising technologies of the future. </li></ul><ul><li>That is easily realized by taking a look into the industrial product sales which rose from $500 million in 1997 to $38 billion in 2003. </li></ul>

×