SEQUENCE TO SEQUENCE MODEL IN
SPEECH RECOGNITION
ATTENTION BASED MODEL
By:
Aditya Kumar Khare
B.Tech (Computer Science)
1402910012
INTRODUCTION
 Speech recognition technology has recently reached a higher level of
performance and robustness, allowing it to communicate to another user by
talking .
 Speech Recognization is process of decoding acoustic speech signal captured
by microphone or telephone ,to a set of words.
 And with the help of these it will recognize whole speech is recognized word
by word .
TYPES OF SPEECH RECOGNITION
 SSpeaker independent models recognize the speech patterns of
• large group of people.
 SSpeaker dependent models recognize speech patterns from only one person.
Both models use mathematical and statistical formulas to yield the best work
match for speech. A third variation of speaker models is now emerging, called
speaker adaptive.
 S peaker adaptive systems usually begin with a speaker independent model
and adjust these models more closely to each individual during a brief
training period.
WHY DO WE NEED TO IMPROVE SPEECH
RECOGNITION
• Most Natural form of communication and allows us to build an interface that’s
much more intuitive than a passive interface.
• To make the interaction between us and the devices more alive and effective .
• We are always targeting productivity and This improves our performance in our
daily lives multiple times better.
• Allow the next billion people to adapt the technology who are still not able to
interact and understand the way these devices work.
HOW TRADITIONAL SPEECH RECOGNITION WORKS
(ASR)
• Also known as the automatic speech recognition is the traditional system to
recognize speech .
• It works in basic 5 steps.
• `Step 1 : User Input
• The System catches user’s voice in the form of analog acoustic signal.
• Step 2: Digitization
• Digitize the analog acoustic signal.
• Step 3: Phonetic Breakdown
• Breaking signals into phonemes.
• Phonemes: Perceptually distinct unit of sounds in a specified language that are
different from one another. Ex: a group of different sound perceived to have the
function. .. /k/ in cat , kit, scat, skit etc.
• Step 4 : Statistical Modeling
• Mapping phonemes to their phonetic representation statistics model.
• Step 5: Matching
• According to grammar , phonetic representation and dictionary ,the system returns an n- best
list (i.e. a word plus a confidence score)
• Grammar : the union words or phrases to constraint the range of input or output in the voice
application.
AUTOMATIC SPEECH RECOGNITION
• Acoustic Model
• Pronunciation Model
• Language Model
ACOUSTIC MODEL
• Acoustic modeling of speech typically refers to the process of establishing
statistical representations for the feature vector sequences computed from the
speech waveform
• Hidden Markov Model (HMM) is one most common type of acoustic models
• The task of this model is to show the representation of the relationship between
an audio signal and it’s phonemes.
• In short it represents Sound in numbers.
PRONUNCIATION MODEL
• These models guide in detecting the variations and predicting the right word
spoken.
• These models help us take care of the pronunciation variation in the accent that
we have face in our daily life.
• Example: technology can be different in British and American accent.
• It describes how a sequence or multi-sequences of fundamental speech
units (such as phones or phonetic feature) are used to represent larger speech
units such as words or phrases
LANGUAGE MODEL
• It provides the Context to distinguish between words and phrases that sound
similar.
• It’s the model that deals with the probabilities of sequences of words that will be
able to predict from the given sound waveform
• It’s the probability distribution over the sequence of words that helps us to be
better in our prediction algorithm.
• It constraints the search to a limited region by looking at the waveform and then
predicting it.
WHY WE NEED BETTER?
• The traditional method are divided into 3 different models which are
interdependent on each other which makes it hard for the community to
experiment with .
• The new suggested ways subsume all the three methods into one complete
model that does the interlinking on it’s own .
• It’s being implemented to avoid human mistakes that takes place and to use the
machine learning capabilities to build better systems.
SEQUENCE TO SEQUENCE MODEL
• A sequence to Sequence model is about training models to convert sequences
from one domain to Sequences of another domain. Whenever you feel the need
to generate a text from audio you can easily use it.
• We have a sequence of frames from a video and we want to analyze the action
being performed in the video from these frame sequences. By dividing the video
into multiple frames and then looking for the individual changes it leads to the
identification of the actions performed.
• This can be used to translate from one language to another as these all are a kind
of sequences.
LISTEN, ATTEND AND SPELL MODEL
• It’s attention based sequence to sequence model that subsumes all the traditional
models into one model.
• Attention based model are based on two key things
1. Passing the information from the encoder to the decoder
2. Passing the context of the learning mechanism of the speech of where to pay
attention.
• This include an Encoder, which is analogous to traditional acoustic model.
• An Attender, which acts as an alignment model.
• And the last one Decoder, that is analogous to the language model in a conventional
model.
• These change were done to minimize the (MWER) Minimum Word Error Rate.
• Listener Encoder :
Similar to acoustic model takes the input features and maps them to higher level feature
representation.
• Attender :
• Determines which encoder feature should be attended to in order to predict next
output symbol
• Decoder :
• Takes the attention context generated from the attender,as well as the embedding of
the previous prediction in order to produce a probability distribution.
IMPROVEMENTS IN THE MODEL
• Structural Improvements
1. Word piece model
2. Multi headed attention
• Optimization Improvements
1. Minimum error rate training
2. Scheduled Sampling
• Word Piece Model :
• Traditional Use of graphemes as output units in sequence to sequence model helps
fold the am pm lm models into one neural network.
• The use of separate phonemes that require P.M. , L.M. model were not found to
improve model accuracy over graphemes.
• Much lower perplexity in word piece model allows for better decoder language
model.
• Modelling longer units (word piece )improves the effect of the decoder
Short term memory) , and allows the model to potentially memorize pronunciations
for frequently occurring words .
• Words are partitioned deterministically and independent of context, using greedy
algorithm to only get the words and not the possibility due to false prediction.
• Longer units require fewer decoding steps, this speeds up inference in these
models significantly.
• Multi Headed Attention:
• M.H.A. extends the conventional attention mechanism to have multiple heads,
where each head can generate a different attention distribution.
• This allows different heads to attend the encoder outputs differently and they
have their own individual role.
• Single headed architecture, the encoder here provides the model clear
signals about the utterances so that the decoder can pick up the information
with attention.
OPTIMIZATION MODEL
• Minimum Word Error rate Training :
• The loss function that we optimize for the attention based system is a sequence level loss
function, while not the word error rate.
• The strategy is to minimize expected number of word errors .
• Scheduled Sampling :
• Feeding ground truth label as the previous prediction (so called teacher forcing) helps the
decoder to learn quickly at the beginning but introduce a minimum difference between training
and inference.
• The scheduled sampling process on the other hand samples from the probability distribution of
the previous prediction and then uses the resulting token to feed as the previous token when
predicting the next label.
CONCLUSION
• Structural Improvements:
• Word piece model performs slightly better than graphemes, resulting in roughly
2% relative improvement in W.E.R. (Word error rate). Rest is on the top of the
M.H.A. and WPM model.
• Optimizational Improvement:
• Includes synchronous training on top of the W.P.M.+ M.H.A. model provides a 3.8
% improvement. Overall the optimizations is around 22.5% moving the W.E.R.
from 8.0% to 6.2%
FINAL OUTPUT
• Sequence to sequence model, gives 11% relative improvement in W.E.R. but
unidirectional L.A.S. system has some limitation.
• The entire utterance must be seen by the encoder, before any labels can be
decoded .
• So, for the purpose of not having to look at all the utterance at the same time we
need some online patterns and algorithms that will help us in the streaming
attention based model .

Sequence to sequence model speech recognition

  • 1.
    SEQUENCE TO SEQUENCEMODEL IN SPEECH RECOGNITION ATTENTION BASED MODEL By: Aditya Kumar Khare B.Tech (Computer Science) 1402910012
  • 2.
    INTRODUCTION  Speech recognitiontechnology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking .  Speech Recognization is process of decoding acoustic speech signal captured by microphone or telephone ,to a set of words.  And with the help of these it will recognize whole speech is recognized word by word .
  • 3.
    TYPES OF SPEECHRECOGNITION  SSpeaker independent models recognize the speech patterns of • large group of people.  SSpeaker dependent models recognize speech patterns from only one person. Both models use mathematical and statistical formulas to yield the best work match for speech. A third variation of speaker models is now emerging, called speaker adaptive.  S peaker adaptive systems usually begin with a speaker independent model and adjust these models more closely to each individual during a brief training period.
  • 4.
    WHY DO WENEED TO IMPROVE SPEECH RECOGNITION • Most Natural form of communication and allows us to build an interface that’s much more intuitive than a passive interface. • To make the interaction between us and the devices more alive and effective . • We are always targeting productivity and This improves our performance in our daily lives multiple times better. • Allow the next billion people to adapt the technology who are still not able to interact and understand the way these devices work.
  • 5.
    HOW TRADITIONAL SPEECHRECOGNITION WORKS (ASR) • Also known as the automatic speech recognition is the traditional system to recognize speech . • It works in basic 5 steps. • `Step 1 : User Input • The System catches user’s voice in the form of analog acoustic signal. • Step 2: Digitization • Digitize the analog acoustic signal.
  • 6.
    • Step 3:Phonetic Breakdown • Breaking signals into phonemes. • Phonemes: Perceptually distinct unit of sounds in a specified language that are different from one another. Ex: a group of different sound perceived to have the function. .. /k/ in cat , kit, scat, skit etc. • Step 4 : Statistical Modeling • Mapping phonemes to their phonetic representation statistics model. • Step 5: Matching • According to grammar , phonetic representation and dictionary ,the system returns an n- best list (i.e. a word plus a confidence score) • Grammar : the union words or phrases to constraint the range of input or output in the voice application.
  • 7.
    AUTOMATIC SPEECH RECOGNITION •Acoustic Model • Pronunciation Model • Language Model
  • 8.
    ACOUSTIC MODEL • Acousticmodeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform • Hidden Markov Model (HMM) is one most common type of acoustic models • The task of this model is to show the representation of the relationship between an audio signal and it’s phonemes. • In short it represents Sound in numbers.
  • 9.
    PRONUNCIATION MODEL • Thesemodels guide in detecting the variations and predicting the right word spoken. • These models help us take care of the pronunciation variation in the accent that we have face in our daily life. • Example: technology can be different in British and American accent. • It describes how a sequence or multi-sequences of fundamental speech units (such as phones or phonetic feature) are used to represent larger speech units such as words or phrases
  • 10.
    LANGUAGE MODEL • Itprovides the Context to distinguish between words and phrases that sound similar. • It’s the model that deals with the probabilities of sequences of words that will be able to predict from the given sound waveform • It’s the probability distribution over the sequence of words that helps us to be better in our prediction algorithm. • It constraints the search to a limited region by looking at the waveform and then predicting it.
  • 11.
    WHY WE NEEDBETTER? • The traditional method are divided into 3 different models which are interdependent on each other which makes it hard for the community to experiment with . • The new suggested ways subsume all the three methods into one complete model that does the interlinking on it’s own . • It’s being implemented to avoid human mistakes that takes place and to use the machine learning capabilities to build better systems.
  • 12.
    SEQUENCE TO SEQUENCEMODEL • A sequence to Sequence model is about training models to convert sequences from one domain to Sequences of another domain. Whenever you feel the need to generate a text from audio you can easily use it. • We have a sequence of frames from a video and we want to analyze the action being performed in the video from these frame sequences. By dividing the video into multiple frames and then looking for the individual changes it leads to the identification of the actions performed. • This can be used to translate from one language to another as these all are a kind of sequences.
  • 13.
    LISTEN, ATTEND ANDSPELL MODEL • It’s attention based sequence to sequence model that subsumes all the traditional models into one model. • Attention based model are based on two key things 1. Passing the information from the encoder to the decoder 2. Passing the context of the learning mechanism of the speech of where to pay attention. • This include an Encoder, which is analogous to traditional acoustic model. • An Attender, which acts as an alignment model.
  • 14.
    • And thelast one Decoder, that is analogous to the language model in a conventional model. • These change were done to minimize the (MWER) Minimum Word Error Rate. • Listener Encoder : Similar to acoustic model takes the input features and maps them to higher level feature representation. • Attender : • Determines which encoder feature should be attended to in order to predict next output symbol • Decoder : • Takes the attention context generated from the attender,as well as the embedding of the previous prediction in order to produce a probability distribution.
  • 15.
    IMPROVEMENTS IN THEMODEL • Structural Improvements 1. Word piece model 2. Multi headed attention • Optimization Improvements 1. Minimum error rate training 2. Scheduled Sampling
  • 16.
    • Word PieceModel : • Traditional Use of graphemes as output units in sequence to sequence model helps fold the am pm lm models into one neural network. • The use of separate phonemes that require P.M. , L.M. model were not found to improve model accuracy over graphemes. • Much lower perplexity in word piece model allows for better decoder language model. • Modelling longer units (word piece )improves the effect of the decoder Short term memory) , and allows the model to potentially memorize pronunciations for frequently occurring words . • Words are partitioned deterministically and independent of context, using greedy algorithm to only get the words and not the possibility due to false prediction. • Longer units require fewer decoding steps, this speeds up inference in these models significantly.
  • 17.
    • Multi HeadedAttention: • M.H.A. extends the conventional attention mechanism to have multiple heads, where each head can generate a different attention distribution. • This allows different heads to attend the encoder outputs differently and they have their own individual role. • Single headed architecture, the encoder here provides the model clear signals about the utterances so that the decoder can pick up the information with attention.
  • 18.
    OPTIMIZATION MODEL • MinimumWord Error rate Training : • The loss function that we optimize for the attention based system is a sequence level loss function, while not the word error rate. • The strategy is to minimize expected number of word errors . • Scheduled Sampling : • Feeding ground truth label as the previous prediction (so called teacher forcing) helps the decoder to learn quickly at the beginning but introduce a minimum difference between training and inference. • The scheduled sampling process on the other hand samples from the probability distribution of the previous prediction and then uses the resulting token to feed as the previous token when predicting the next label.
  • 19.
    CONCLUSION • Structural Improvements: •Word piece model performs slightly better than graphemes, resulting in roughly 2% relative improvement in W.E.R. (Word error rate). Rest is on the top of the M.H.A. and WPM model. • Optimizational Improvement: • Includes synchronous training on top of the W.P.M.+ M.H.A. model provides a 3.8 % improvement. Overall the optimizations is around 22.5% moving the W.E.R. from 8.0% to 6.2%
  • 20.
    FINAL OUTPUT • Sequenceto sequence model, gives 11% relative improvement in W.E.R. but unidirectional L.A.S. system has some limitation. • The entire utterance must be seen by the encoder, before any labels can be decoded . • So, for the purpose of not having to look at all the utterance at the same time we need some online patterns and algorithms that will help us in the streaming attention based model .