INTERFACE
1
Husain Firoz Master
(302093)
Guided by proff. Vidya Patil
Outline
 Introduction
 Need for SUI
 Expectations from SUI
 overview of speech Recognition
 Voice features extraction and it technique
 Implementation of SUI
 Applications
 Future scope
 Shortcomings
 Conclusion
2
SPEECH USER INTERFACE (SUI)
 A user interface that works with human voice commands
 It offers truly hands free, eyes free interaction with computers
 It provide interface for operating computers with following understandings:
 Technology support
 User category
 User support
3
NEED FOR A SPEECH USER INTERFACE
 It offer truly hands free, eyes-free interaction
 have unmatched throughput rates
 are the only plausible interaction modality for illiterate users across the world
 speech is faster than typing on a keyboard
 Present opportunities for illiterate users in developing regions, giving them a
feasible way to access computing.
 but they are not yet developed in abundance to support every type of user,
language, or acoustic scenario.
4
Expectations from speech user interface
(SUI)
 Recognize speech from any untrained users
 Understand the meaning of the spoken word
 Make the action as per meaning extracted word
 Deal with multiple languages
 Incorporate with large vocabularies
 Provide good fault tolerance level
 Provide help and messages, to users during interaction
 Operate in real-time
5
Speech Recognition
 Translation of spoken words into text
 It is the ability of machines to understand natural human spoken language.
 Two types of Speech recognition, speaker-dependent and speaker-independent
 speaker-dependent :-Systems that require training
 speaker-independent :-Systems that do not require training
 Basically it is the process by which a computer maps an acoustic speech signal to
text.
6
SPEECH RECOGNITION MODEL7
VOICE FEATURE EXTRACTION
 Voice feature extraction is known as front end processing.
 It is performed in both recognition and training mode.
 converts digital speech signal into sets of numerical descriptors called feature
vectors .
 contain key characteristics of the speech signal.
 It evaluate of the different types of feature extracted from voice to determine their
suitability for voice recognition
 MFCC and HMM are one the most currently used feature extraction techniques.
8
Feature Extraction Techniques
 Mel-frequency Cepstral coefficients (MFCC)
 Hidden Markov model (HMM)
 Dynamic time warping (DTW)
 Fusion HMM and DTW
9
Mel-frequency Cepstral coefficients (MFCC)
 Mel-frequency cepstral (MFC) is a representation of the short-term power
spectrum of a sound, based on a linear cosine transform of a log power spectrum.
 Mel-frequency Cepstral coefficients (MFCCs) are coefficients which contents the
frequency bands of an audio input.
 The frequency bands are equally spaced, by which mfcc approximates the human
auditory system's response more closely.
 The use of about 20 MFCC coefficients is common in ASR, although 10-12
coefficients are often considered to be sufficient for coding speech.
10
Hidden Markov model (HMM)
 HMM models are used for representing the possible symbol sequences underlying
speech utterances.
 The states in the HMM represent easy spoken basic linguistic units (e.g. phonemes
or smaller phases of phonemes) that are used by the human to pronounce a word.
 For each word, one or more complex HMMs exist, which model the probability to
articulate a state sequence representing a word.
 HMMs are usually trained from large sets of recorded and feature analyzed
samples.
11
Dynamic time warping (DTW)
 dynamic time warping (DTW) is an algorithm for measuring similarity between two
spoken word sequences which may vary in time or speed.
 it compares two speech sequences.
 It measures similarity between two sequences
12
Fusion HMM and DTW
 In this method HMM and DTW are combined
 Basically the results of HMM and DTW are combined in weight mean vectors
 DTW find the similarity between two signals based on time
 Meanwhile HMM trains cluster and iteratively moves between clusters based on
their likelihoods given to it while training.
13
SUI IMPLEMENTATION14
SUI IMPLEMENTATION (contd.)
 Recording Speech
 Applying Noise cancellation
 End point detection
 Feature extraction :MFCC algorithm is used and parameters are separated, that are
further used in training part.
 Normalization: Word length is calculated for all groups and made an average for
each.
15
SUI IMPLEMENTATION (contd.)
 Training using HMM
 Fusion HMM and DTW
 Recognized word or sentence is given to the application
16
Applications
 Speech Operated Calculator.
 Voice Dialing
 intelligent voice assistant (Personal Agent) like Apple SIRI, google voice talk etc.
 Home and Building Automation Systems using SUI
 live subtitling on television
 speech-to-text conversion or note taking systems
17
Future Scope
 Speech User Interface For Learning Foreign Languages
 Dictation tools in the medical and legal profession
18
SHORTCOMINGS
 Train the speech recognition system in the implementation environ-ment.
 Keep vocabulary Size small
 Keep short each speech input (word length).
 Use speech inputs that sound distinctly deferent from each other.
 Keep the user interface simple.
 Don't use speech to position objects.
 Use a command-based user interface.
 Allow users to quickly and easily turn oFF and on the speech recognizer.
 Use a highly directional, noise-canceling microphone
19
CONCLUSION20

Speech user interface

  • 1.
  • 2.
    Outline  Introduction  Needfor SUI  Expectations from SUI  overview of speech Recognition  Voice features extraction and it technique  Implementation of SUI  Applications  Future scope  Shortcomings  Conclusion 2
  • 3.
    SPEECH USER INTERFACE(SUI)  A user interface that works with human voice commands  It offers truly hands free, eyes free interaction with computers  It provide interface for operating computers with following understandings:  Technology support  User category  User support 3
  • 4.
    NEED FOR ASPEECH USER INTERFACE  It offer truly hands free, eyes-free interaction  have unmatched throughput rates  are the only plausible interaction modality for illiterate users across the world  speech is faster than typing on a keyboard  Present opportunities for illiterate users in developing regions, giving them a feasible way to access computing.  but they are not yet developed in abundance to support every type of user, language, or acoustic scenario. 4
  • 5.
    Expectations from speechuser interface (SUI)  Recognize speech from any untrained users  Understand the meaning of the spoken word  Make the action as per meaning extracted word  Deal with multiple languages  Incorporate with large vocabularies  Provide good fault tolerance level  Provide help and messages, to users during interaction  Operate in real-time 5
  • 6.
    Speech Recognition  Translationof spoken words into text  It is the ability of machines to understand natural human spoken language.  Two types of Speech recognition, speaker-dependent and speaker-independent  speaker-dependent :-Systems that require training  speaker-independent :-Systems that do not require training  Basically it is the process by which a computer maps an acoustic speech signal to text. 6
  • 7.
  • 8.
    VOICE FEATURE EXTRACTION Voice feature extraction is known as front end processing.  It is performed in both recognition and training mode.  converts digital speech signal into sets of numerical descriptors called feature vectors .  contain key characteristics of the speech signal.  It evaluate of the different types of feature extracted from voice to determine their suitability for voice recognition  MFCC and HMM are one the most currently used feature extraction techniques. 8
  • 9.
    Feature Extraction Techniques Mel-frequency Cepstral coefficients (MFCC)  Hidden Markov model (HMM)  Dynamic time warping (DTW)  Fusion HMM and DTW 9
  • 10.
    Mel-frequency Cepstral coefficients(MFCC)  Mel-frequency cepstral (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum.  Mel-frequency Cepstral coefficients (MFCCs) are coefficients which contents the frequency bands of an audio input.  The frequency bands are equally spaced, by which mfcc approximates the human auditory system's response more closely.  The use of about 20 MFCC coefficients is common in ASR, although 10-12 coefficients are often considered to be sufficient for coding speech. 10
  • 11.
    Hidden Markov model(HMM)  HMM models are used for representing the possible symbol sequences underlying speech utterances.  The states in the HMM represent easy spoken basic linguistic units (e.g. phonemes or smaller phases of phonemes) that are used by the human to pronounce a word.  For each word, one or more complex HMMs exist, which model the probability to articulate a state sequence representing a word.  HMMs are usually trained from large sets of recorded and feature analyzed samples. 11
  • 12.
    Dynamic time warping(DTW)  dynamic time warping (DTW) is an algorithm for measuring similarity between two spoken word sequences which may vary in time or speed.  it compares two speech sequences.  It measures similarity between two sequences 12
  • 13.
    Fusion HMM andDTW  In this method HMM and DTW are combined  Basically the results of HMM and DTW are combined in weight mean vectors  DTW find the similarity between two signals based on time  Meanwhile HMM trains cluster and iteratively moves between clusters based on their likelihoods given to it while training. 13
  • 14.
  • 15.
    SUI IMPLEMENTATION (contd.) Recording Speech  Applying Noise cancellation  End point detection  Feature extraction :MFCC algorithm is used and parameters are separated, that are further used in training part.  Normalization: Word length is calculated for all groups and made an average for each. 15
  • 16.
    SUI IMPLEMENTATION (contd.) Training using HMM  Fusion HMM and DTW  Recognized word or sentence is given to the application 16
  • 17.
    Applications  Speech OperatedCalculator.  Voice Dialing  intelligent voice assistant (Personal Agent) like Apple SIRI, google voice talk etc.  Home and Building Automation Systems using SUI  live subtitling on television  speech-to-text conversion or note taking systems 17
  • 18.
    Future Scope  SpeechUser Interface For Learning Foreign Languages  Dictation tools in the medical and legal profession 18
  • 19.
    SHORTCOMINGS  Train thespeech recognition system in the implementation environ-ment.  Keep vocabulary Size small  Keep short each speech input (word length).  Use speech inputs that sound distinctly deferent from each other.  Keep the user interface simple.  Don't use speech to position objects.  Use a command-based user interface.  Allow users to quickly and easily turn oFF and on the speech recognizer.  Use a highly directional, noise-canceling microphone 19
  • 20.