A utomatic  S peech   R ecognition By: Khalid El-Darymli G0327887
OUTLINE Automatic Speech Recognition ASR: Definition, capabilities, usage, and milestone. Structure of an ASR system: Speech database. MFCC extraction. Training & Recognition. Conclusions.
Multilayer Structure of speech production: [book_airplane_flight] [from_locality] [to_locality] [ departure_time] [I] [would] [like] [to] [book] [a] [flight] [from] [Rome] [to] [London][tomorrow][morning]  [book]  [b/uh/k] Pragmatic Layer Semantic Layer Syntactic Layer Prosodic/Phonetic Layer Acoustic Layer
What is  S peech  R ecognition ? Process of converting acoustic signal captured by microphone or telephone to a set of words. Recognized words can be final results, as for applications such as commands and control, data entry and document preparation. They can also serve as input to further linguistic processing in order to achieve speech understanding.
Capabilities of ASR including: Isolated word recognizers:  for segments separated by pauses. Word spotting:  algorithms that detect occurrences of key words in continuous speech. Connected words recognizers:  that identify uninterrupted, but strictly formatted, sequences of words (e.g. recognition of telephone numbers). Restricted speech understanding:  systems that handle sentences relevant to a specific task. Task independent continuous speech recognizers:  which is the ultimate goal in this field. Two types of systems: Speaker-dependent:  user must provide samples of his/her speech before using them, Speaker independent:  no speaker enrollment necessary.
Uses and Applications  Dictation:  This includes medical transcriptions, legal and business dictation, as well as general word processing.  Command and Control:  ASR systems that are designed to perform functions and actions on the system. Telephony:  Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones. Wearables:  Because inputs are limited for wearable devices, speaking is a natural possibility. Medical/Disabilities:  Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text. Embedded Applications:  Some newer cellular phones include C&C speech recognition that allow utterances such as  “Call Home“  .
A Timeline & History of Voice Recognition Software Dragon released discrete word dictation-level speech recognition software. It was the first time dictation speech & voice recognition technology was available to consumers .   1995   SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) solutions, was founded.  1984   Dragon Systems was founded. 1982   DARPA established the Speech Understanding Research (SUR) program. A $3 million per year of government funds for 5 years.  It was the largest speech recognition project ever.  1971   HMM approach to speech & voice recognition was invented by Lenny Baum of Princeton University  Early 1970's   AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder.  1936
… timeline…continue Scansoft, Inc. is presently the world leader in the technology of Speech Recognition in the commercial market. ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Speech Recognition.  2003   Lernout & Hauspie acquired Dragon Systems for approximately $460 million.  2000   Microsoft invested $45 million to allow Microsoft to use speech & voice recognition technology in their systems.  1998   Dragon introduced "Naturally Speaking", the first "continuous speech" dictation software available  1997
The Structure of ASR System: Functional Scheme of an ASR System Speech samples X Y S W * Database Signal  Interface Feature Extraction Recognition Databases Training HMM
Speech Database: A speech database is a collection of recorded speech accessible on a computer and supported with the necessary annotations and transcriptions. The databases collect the observations required for parameter estimations. The corpora has to be large enough to cover the variability of speech.
Transcription of speech: Ex.: The graph below shows an acoustic waveform for the sentence:  how much allowance. Speech data are  segmented  and  labeled  with the phoneme string  “  /h# hh ak m ah tcl cj ax l aw ax n s /  ” It is linguistic information associated to digital recordings of acoustic signals. This symbolic representation of speech used to easily retrieve the content of the databases. Transcription involving: -  Segmentation  and  Labeling . Segmentation and labeling example
Many databases are distributed by the  Linguistic Data Consortium   www.ldc.upenn.edu
Speech Signal Analysis Feature Extraction for ASR: - The aim is to extract the voice features to distinguish different phonemes of a language.
MFCC extraction: Pre-emphasis:  to obtain similar amplitude for all formants. Windowing:  to enable spectral evaluation our signal has to be stationary, accordingly, windowing will make it possible to perform spectral evaluation on a short time periods. Pre-emphasis DFT Mel filter banks Log(|| 2 ) IDFT Speech signal x(n) WINDOW x ’ (n) x t  (n) X t (k) Y t (m) MFCC y t (m) (k)
Spectral Analysis: DFT:    X t (k)  =X t (e j2  k/N ), k=0, …, N-1 Filter Bank processing:  to obtain the spectral feature of speech thru properly integrating spectrum at defined frequency ranges. A set of  24 BPF  is generally used since it simulates human ear processing. The most widely used scale is the Mel scale. UP to here the procedure has the role of smoothing the spectrum and performing a processing that similar to that executed by the human ear - Log(|| 2 )   : -Again this process performed by the human ear as well. The magnitude  will discard the useless phase information. Logarithm  performs a dynamic compression making feature extraction less sensitive to variations in dynamics.
Speech waveform of a phoneme “\ae” Explanatory Example After pre-emphasis and Hamming windowing Power spectrum MFCC
Training  and  Recognition : Training:   - The model is built from a large number of different correspondences (X ’ , W ’ ). - This is the same training procedure of a baby. -  The greater the number of couples, the greater is the recognition accuracy. Recognition:   - All the possible sequences of words  W are tested to find the W *  whose acoustic sequence X=h (W * ,  ) best matches the one given.
Deterministic  vs.  Stochastic  framework: Deterministic framework: DTW:  One or more acoustic templates are memorized per word. Drawback :  This is not sufficient to represent all the speech variability, i.e. all the X associated to a word. Stochastic framework: The knowledge is embedded stochastically. This allows us to consider a model like ( HMM ) that takes more correspondences (X ’ ,W ’ ) into account.
Implementing  HMM  to speech Modeling   Training  and  Recognition The recognition procedure may be divided into two distinct stages: - Building HMM speech models based on the correspondence between the observation sequences  Y  and the state sequence ( S ).   (TRAINNING). - Recognizing speech by the stored HMM models    and by the actual observation Y.   (RECOGNITION) Training HMM Feature  Extraction Recognition W * Y Y S Speech Samples 
Implementation of HMM: HMM of a simple grammar:  “  \sil ,  NO ,  YES ” P(w t =yes\w t-1 =\sil)=0.2 P(w t =\sil|w t-1 =yes)=1 P(w t =\sil|w t-1 =no)=1 P(w t =no\w t-1 =\sil)=0.2 P(s t \s t-1 ) s (0) Silence Start S (1) S (2) S (3) S (4) S (5) S (6) S (7) S (8) S (9) S (10) S (11) S (12) Phoneme ‘ YE ’ Phoneme ‘ S ’ w= YES w= NO Phoneme ‘ N ’ Phoneme ‘ O ’ P(Y\s t =s (9) ) Y 0.6
The search Algorithm: Hypothesis tree of the Viterbi search algorithm s (0) s (7) s (0) s (1) s (8) s (7) s (0) s (1) s (2) Time=1 Time=2 Time=3 0.1 0.4 0.1 0.025 0.021 0.051 0.041 0.045 0.036 0.032
Conclusions: Modern speech understanding systems merge interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework. Voice commanded applications are expected to cover many of the aspects of our future daily life. Car computers, telephones and general appliances are the more likely candidates for this revolution that may reduce drastically the use of the keyboard. Speech Recognition is nowadays regarded by market projections as one of the more promising technologies of the future. That is easily realized by taking a look into the industrial product sales which rose from $500 million in 1997 to $38 billion in 2003.

Asr

  • 1.
    A utomatic S peech R ecognition By: Khalid El-Darymli G0327887
  • 2.
    OUTLINE Automatic SpeechRecognition ASR: Definition, capabilities, usage, and milestone. Structure of an ASR system: Speech database. MFCC extraction. Training & Recognition. Conclusions.
  • 3.
    Multilayer Structure ofspeech production: [book_airplane_flight] [from_locality] [to_locality] [ departure_time] [I] [would] [like] [to] [book] [a] [flight] [from] [Rome] [to] [London][tomorrow][morning] [book]  [b/uh/k] Pragmatic Layer Semantic Layer Syntactic Layer Prosodic/Phonetic Layer Acoustic Layer
  • 4.
    What is S peech R ecognition ? Process of converting acoustic signal captured by microphone or telephone to a set of words. Recognized words can be final results, as for applications such as commands and control, data entry and document preparation. They can also serve as input to further linguistic processing in order to achieve speech understanding.
  • 5.
    Capabilities of ASRincluding: Isolated word recognizers: for segments separated by pauses. Word spotting: algorithms that detect occurrences of key words in continuous speech. Connected words recognizers: that identify uninterrupted, but strictly formatted, sequences of words (e.g. recognition of telephone numbers). Restricted speech understanding: systems that handle sentences relevant to a specific task. Task independent continuous speech recognizers: which is the ultimate goal in this field. Two types of systems: Speaker-dependent: user must provide samples of his/her speech before using them, Speaker independent: no speaker enrollment necessary.
  • 6.
    Uses and Applications Dictation: This includes medical transcriptions, legal and business dictation, as well as general word processing. Command and Control: ASR systems that are designed to perform functions and actions on the system. Telephony: Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones. Wearables: Because inputs are limited for wearable devices, speaking is a natural possibility. Medical/Disabilities: Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text. Embedded Applications: Some newer cellular phones include C&C speech recognition that allow utterances such as “Call Home“ .
  • 7.
    A Timeline &History of Voice Recognition Software Dragon released discrete word dictation-level speech recognition software. It was the first time dictation speech & voice recognition technology was available to consumers . 1995 SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) solutions, was founded. 1984 Dragon Systems was founded. 1982 DARPA established the Speech Understanding Research (SUR) program. A $3 million per year of government funds for 5 years. It was the largest speech recognition project ever. 1971 HMM approach to speech & voice recognition was invented by Lenny Baum of Princeton University Early 1970's AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder. 1936
  • 8.
    … timeline…continue Scansoft,Inc. is presently the world leader in the technology of Speech Recognition in the commercial market. ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Speech Recognition. 2003 Lernout & Hauspie acquired Dragon Systems for approximately $460 million. 2000 Microsoft invested $45 million to allow Microsoft to use speech & voice recognition technology in their systems. 1998 Dragon introduced "Naturally Speaking", the first "continuous speech" dictation software available 1997
  • 9.
    The Structure ofASR System: Functional Scheme of an ASR System Speech samples X Y S W * Database Signal Interface Feature Extraction Recognition Databases Training HMM
  • 10.
    Speech Database: Aspeech database is a collection of recorded speech accessible on a computer and supported with the necessary annotations and transcriptions. The databases collect the observations required for parameter estimations. The corpora has to be large enough to cover the variability of speech.
  • 11.
    Transcription of speech:Ex.: The graph below shows an acoustic waveform for the sentence: how much allowance. Speech data are segmented and labeled with the phoneme string “ /h# hh ak m ah tcl cj ax l aw ax n s / ” It is linguistic information associated to digital recordings of acoustic signals. This symbolic representation of speech used to easily retrieve the content of the databases. Transcription involving: - Segmentation and Labeling . Segmentation and labeling example
  • 12.
    Many databases aredistributed by the Linguistic Data Consortium www.ldc.upenn.edu
  • 13.
    Speech Signal AnalysisFeature Extraction for ASR: - The aim is to extract the voice features to distinguish different phonemes of a language.
  • 14.
    MFCC extraction: Pre-emphasis: to obtain similar amplitude for all formants. Windowing: to enable spectral evaluation our signal has to be stationary, accordingly, windowing will make it possible to perform spectral evaluation on a short time periods. Pre-emphasis DFT Mel filter banks Log(|| 2 ) IDFT Speech signal x(n) WINDOW x ’ (n) x t (n) X t (k) Y t (m) MFCC y t (m) (k)
  • 15.
    Spectral Analysis: DFT:  X t (k) =X t (e j2  k/N ), k=0, …, N-1 Filter Bank processing: to obtain the spectral feature of speech thru properly integrating spectrum at defined frequency ranges. A set of 24 BPF is generally used since it simulates human ear processing. The most widely used scale is the Mel scale. UP to here the procedure has the role of smoothing the spectrum and performing a processing that similar to that executed by the human ear - Log(|| 2 ) : -Again this process performed by the human ear as well. The magnitude will discard the useless phase information. Logarithm performs a dynamic compression making feature extraction less sensitive to variations in dynamics.
  • 16.
    Speech waveform ofa phoneme “\ae” Explanatory Example After pre-emphasis and Hamming windowing Power spectrum MFCC
  • 17.
    Training and Recognition : Training: - The model is built from a large number of different correspondences (X ’ , W ’ ). - This is the same training procedure of a baby. - The greater the number of couples, the greater is the recognition accuracy. Recognition: - All the possible sequences of words W are tested to find the W * whose acoustic sequence X=h (W * ,  ) best matches the one given.
  • 18.
    Deterministic vs. Stochastic framework: Deterministic framework: DTW: One or more acoustic templates are memorized per word. Drawback : This is not sufficient to represent all the speech variability, i.e. all the X associated to a word. Stochastic framework: The knowledge is embedded stochastically. This allows us to consider a model like ( HMM ) that takes more correspondences (X ’ ,W ’ ) into account.
  • 19.
    Implementing HMM to speech Modeling Training and Recognition The recognition procedure may be divided into two distinct stages: - Building HMM speech models based on the correspondence between the observation sequences Y and the state sequence ( S ).  (TRAINNING). - Recognizing speech by the stored HMM models  and by the actual observation Y.  (RECOGNITION) Training HMM Feature Extraction Recognition W * Y Y S Speech Samples 
  • 20.
    Implementation of HMM:HMM of a simple grammar: “ \sil , NO , YES ” P(w t =yes\w t-1 =\sil)=0.2 P(w t =\sil|w t-1 =yes)=1 P(w t =\sil|w t-1 =no)=1 P(w t =no\w t-1 =\sil)=0.2 P(s t \s t-1 ) s (0) Silence Start S (1) S (2) S (3) S (4) S (5) S (6) S (7) S (8) S (9) S (10) S (11) S (12) Phoneme ‘ YE ’ Phoneme ‘ S ’ w= YES w= NO Phoneme ‘ N ’ Phoneme ‘ O ’ P(Y\s t =s (9) ) Y 0.6
  • 21.
    The search Algorithm:Hypothesis tree of the Viterbi search algorithm s (0) s (7) s (0) s (1) s (8) s (7) s (0) s (1) s (2) Time=1 Time=2 Time=3 0.1 0.4 0.1 0.025 0.021 0.051 0.041 0.045 0.036 0.032
  • 22.
    Conclusions: Modern speechunderstanding systems merge interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework. Voice commanded applications are expected to cover many of the aspects of our future daily life. Car computers, telephones and general appliances are the more likely candidates for this revolution that may reduce drastically the use of the keyboard. Speech Recognition is nowadays regarded by market projections as one of the more promising technologies of the future. That is easily realized by taking a look into the industrial product sales which rose from $500 million in 1997 to $38 billion in 2003.