SlideShare a Scribd company logo
SEQUENCE TO SEQUENCE MODEL IN
SPEECH RECOGNITION
ATTENTION BASED MODEL
By:
Aditya Kumar Khare
B.Tech (Computer Science)
1402910012
INTRODUCTION
 Speech recognition technology has recently reached a higher level of
performance and robustness, allowing it to communicate to another user by
talking .
 Speech Recognization is process of decoding acoustic speech signal captured
by microphone or telephone ,to a set of words.
 And with the help of these it will recognize whole speech is recognized word
by word .
TYPES OF SPEECH RECOGNITION
 SSpeaker independent models recognize the speech patterns of
• large group of people.
 SSpeaker dependent models recognize speech patterns from only one person.
Both models use mathematical and statistical formulas to yield the best work
match for speech. A third variation of speaker models is now emerging, called
speaker adaptive.
 S peaker adaptive systems usually begin with a speaker independent model
and adjust these models more closely to each individual during a brief
training period.
WHY DO WE NEED TO IMPROVE SPEECH
RECOGNITION
• Most Natural form of communication and allows us to build an interface that’s
much more intuitive than a passive interface.
• To make the interaction between us and the devices more alive and effective .
• We are always targeting productivity and This improves our performance in our
daily lives multiple times better.
• Allow the next billion people to adapt the technology who are still not able to
interact and understand the way these devices work.
HOW TRADITIONAL SPEECH RECOGNITION WORKS
(ASR)
• Also known as the automatic speech recognition is the traditional system to
recognize speech .
• It works in basic 5 steps.
• `Step 1 : User Input
• The System catches user’s voice in the form of analog acoustic signal.
• Step 2: Digitization
• Digitize the analog acoustic signal.
• Step 3: Phonetic Breakdown
• Breaking signals into phonemes.
• Phonemes: Perceptually distinct unit of sounds in a specified language that are
different from one another. Ex: a group of different sound perceived to have the
function. .. /k/ in cat , kit, scat, skit etc.
• Step 4 : Statistical Modeling
• Mapping phonemes to their phonetic representation statistics model.
• Step 5: Matching
• According to grammar , phonetic representation and dictionary ,the system returns an n- best
list (i.e. a word plus a confidence score)
• Grammar : the union words or phrases to constraint the range of input or output in the voice
application.
AUTOMATIC SPEECH RECOGNITION
• Acoustic Model
• Pronunciation Model
• Language Model
ACOUSTIC MODEL
• Acoustic modeling of speech typically refers to the process of establishing
statistical representations for the feature vector sequences computed from the
speech waveform
• Hidden Markov Model (HMM) is one most common type of acoustic models
• The task of this model is to show the representation of the relationship between
an audio signal and it’s phonemes.
• In short it represents Sound in numbers.
PRONUNCIATION MODEL
• These models guide in detecting the variations and predicting the right word
spoken.
• These models help us take care of the pronunciation variation in the accent that
we have face in our daily life.
• Example: technology can be different in British and American accent.
• It describes how a sequence or multi-sequences of fundamental speech
units (such as phones or phonetic feature) are used to represent larger speech
units such as words or phrases
LANGUAGE MODEL
• It provides the Context to distinguish between words and phrases that sound
similar.
• It’s the model that deals with the probabilities of sequences of words that will be
able to predict from the given sound waveform
• It’s the probability distribution over the sequence of words that helps us to be
better in our prediction algorithm.
• It constraints the search to a limited region by looking at the waveform and then
predicting it.
WHY WE NEED BETTER?
• The traditional method are divided into 3 different models which are
interdependent on each other which makes it hard for the community to
experiment with .
• The new suggested ways subsume all the three methods into one complete
model that does the interlinking on it’s own .
• It’s being implemented to avoid human mistakes that takes place and to use the
machine learning capabilities to build better systems.
SEQUENCE TO SEQUENCE MODEL
• A sequence to Sequence model is about training models to convert sequences
from one domain to Sequences of another domain. Whenever you feel the need
to generate a text from audio you can easily use it.
• We have a sequence of frames from a video and we want to analyze the action
being performed in the video from these frame sequences. By dividing the video
into multiple frames and then looking for the individual changes it leads to the
identification of the actions performed.
• This can be used to translate from one language to another as these all are a kind
of sequences.
LISTEN, ATTEND AND SPELL MODEL
• It’s attention based sequence to sequence model that subsumes all the traditional
models into one model.
• Attention based model are based on two key things
1. Passing the information from the encoder to the decoder
2. Passing the context of the learning mechanism of the speech of where to pay
attention.
• This include an Encoder, which is analogous to traditional acoustic model.
• An Attender, which acts as an alignment model.
• And the last one Decoder, that is analogous to the language model in a conventional
model.
• These change were done to minimize the (MWER) Minimum Word Error Rate.
• Listener Encoder :
Similar to acoustic model takes the input features and maps them to higher level feature
representation.
• Attender :
• Determines which encoder feature should be attended to in order to predict next
output symbol
• Decoder :
• Takes the attention context generated from the attender,as well as the embedding of
the previous prediction in order to produce a probability distribution.
IMPROVEMENTS IN THE MODEL
• Structural Improvements
1. Word piece model
2. Multi headed attention
• Optimization Improvements
1. Minimum error rate training
2. Scheduled Sampling
• Word Piece Model :
• Traditional Use of graphemes as output units in sequence to sequence model helps
fold the am pm lm models into one neural network.
• The use of separate phonemes that require P.M. , L.M. model were not found to
improve model accuracy over graphemes.
• Much lower perplexity in word piece model allows for better decoder language
model.
• Modelling longer units (word piece )improves the effect of the decoder
Short term memory) , and allows the model to potentially memorize pronunciations
for frequently occurring words .
• Words are partitioned deterministically and independent of context, using greedy
algorithm to only get the words and not the possibility due to false prediction.
• Longer units require fewer decoding steps, this speeds up inference in these
models significantly.
• Multi Headed Attention:
• M.H.A. extends the conventional attention mechanism to have multiple heads,
where each head can generate a different attention distribution.
• This allows different heads to attend the encoder outputs differently and they
have their own individual role.
• Single headed architecture, the encoder here provides the model clear
signals about the utterances so that the decoder can pick up the information
with attention.
OPTIMIZATION MODEL
• Minimum Word Error rate Training :
• The loss function that we optimize for the attention based system is a sequence level loss
function, while not the word error rate.
• The strategy is to minimize expected number of word errors .
• Scheduled Sampling :
• Feeding ground truth label as the previous prediction (so called teacher forcing) helps the
decoder to learn quickly at the beginning but introduce a minimum difference between training
and inference.
• The scheduled sampling process on the other hand samples from the probability distribution of
the previous prediction and then uses the resulting token to feed as the previous token when
predicting the next label.
CONCLUSION
• Structural Improvements:
• Word piece model performs slightly better than graphemes, resulting in roughly
2% relative improvement in W.E.R. (Word error rate). Rest is on the top of the
M.H.A. and WPM model.
• Optimizational Improvement:
• Includes synchronous training on top of the W.P.M.+ M.H.A. model provides a 3.8
% improvement. Overall the optimizations is around 22.5% moving the W.E.R.
from 8.0% to 6.2%
FINAL OUTPUT
• Sequence to sequence model, gives 11% relative improvement in W.E.R. but
unidirectional L.A.S. system has some limitation.
• The entire utterance must be seen by the encoder, before any labels can be
decoded .
• So, for the purpose of not having to look at all the utterance at the same time we
need some online patterns and algorithms that will help us in the streaming
attention based model .

More Related Content

What's hot

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Richie
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
Archit Vora
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
avinash raibole
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
Anshuli Mittal
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
LINAGORA
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
International Islamic University
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
Ilhaan Marwat
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
anshu shrivastava
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
Seminar Links
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Birudugadda Pranathi
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajan
Abhishek Mahajan
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
sipij
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
himanshubhatti
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Manthan Gandhi
 
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh TomarDeep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh Tomar
WithTheBest
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
Goa App
 
High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...
IJECEIAES
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
boddu syamprasad
 

What's hot (20)

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Voice recognition system
Voice recognition systemVoice recognition system
Voice recognition system
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajan
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh TomarDeep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh Tomar
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Similar to Sequence to sequence model speech recognition

NLP,expert,robotics.pptx
NLP,expert,robotics.pptxNLP,expert,robotics.pptx
NLP,expert,robotics.pptx
AmanBadesra1
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
IJERA Editor
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
MedBelatrach
 
Assign
AssignAssign
Speechrecognition 100423091251-phpapp01
Speechrecognition 100423091251-phpapp01Speechrecognition 100423091251-phpapp01
Speechrecognition 100423091251-phpapp01
girishjoshi1234
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
Chamani Shiranthika
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
IRJET Journal
 
NLP Techniques for Speech Recognition.docx
NLP Techniques for Speech Recognition.docxNLP Techniques for Speech Recognition.docx
NLP Techniques for Speech Recognition.docx
KevinSims18
 
Dy36749754
Dy36749754Dy36749754
Dy36749754
IJERA Editor
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
NUPUR YADAV
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
sajanazoya
 
On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...
rahulmonikasharma
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
ShivangiYadav42
 
IRJET- Vocal Code
IRJET- Vocal CodeIRJET- Vocal Code
IRJET- Vocal Code
IRJET Journal
 
AI_attachment.pptx prepared for all students
AI_attachment.pptx prepared for all  studentsAI_attachment.pptx prepared for all  students
AI_attachment.pptx prepared for all students
talldesalegn
 
Ijetcas14 390
Ijetcas14 390Ijetcas14 390
Ijetcas14 390
Iasir Journals
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Zachary S. Brown
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
socarem879
 

Similar to Sequence to sequence model speech recognition (20)

NLP,expert,robotics.pptx
NLP,expert,robotics.pptxNLP,expert,robotics.pptx
NLP,expert,robotics.pptx
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Assign
AssignAssign
Assign
 
Speechrecognition 100423091251-phpapp01
Speechrecognition 100423091251-phpapp01Speechrecognition 100423091251-phpapp01
Speechrecognition 100423091251-phpapp01
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
Unit 5f.pptx
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
NLP Techniques for Speech Recognition.docx
NLP Techniques for Speech Recognition.docxNLP Techniques for Speech Recognition.docx
NLP Techniques for Speech Recognition.docx
 
Dy36749754
Dy36749754Dy36749754
Dy36749754
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
IRJET- Vocal Code
IRJET- Vocal CodeIRJET- Vocal Code
IRJET- Vocal Code
 
AI_attachment.pptx prepared for all students
AI_attachment.pptx prepared for all  studentsAI_attachment.pptx prepared for all  students
AI_attachment.pptx prepared for all students
 
Ijetcas14 390
Ijetcas14 390Ijetcas14 390
Ijetcas14 390
 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Sequence to sequence model speech recognition

  • 1. SEQUENCE TO SEQUENCE MODEL IN SPEECH RECOGNITION ATTENTION BASED MODEL By: Aditya Kumar Khare B.Tech (Computer Science) 1402910012
  • 2. INTRODUCTION  Speech recognition technology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking .  Speech Recognization is process of decoding acoustic speech signal captured by microphone or telephone ,to a set of words.  And with the help of these it will recognize whole speech is recognized word by word .
  • 3. TYPES OF SPEECH RECOGNITION  SSpeaker independent models recognize the speech patterns of • large group of people.  SSpeaker dependent models recognize speech patterns from only one person. Both models use mathematical and statistical formulas to yield the best work match for speech. A third variation of speaker models is now emerging, called speaker adaptive.  S peaker adaptive systems usually begin with a speaker independent model and adjust these models more closely to each individual during a brief training period.
  • 4. WHY DO WE NEED TO IMPROVE SPEECH RECOGNITION • Most Natural form of communication and allows us to build an interface that’s much more intuitive than a passive interface. • To make the interaction between us and the devices more alive and effective . • We are always targeting productivity and This improves our performance in our daily lives multiple times better. • Allow the next billion people to adapt the technology who are still not able to interact and understand the way these devices work.
  • 5. HOW TRADITIONAL SPEECH RECOGNITION WORKS (ASR) • Also known as the automatic speech recognition is the traditional system to recognize speech . • It works in basic 5 steps. • `Step 1 : User Input • The System catches user’s voice in the form of analog acoustic signal. • Step 2: Digitization • Digitize the analog acoustic signal.
  • 6. • Step 3: Phonetic Breakdown • Breaking signals into phonemes. • Phonemes: Perceptually distinct unit of sounds in a specified language that are different from one another. Ex: a group of different sound perceived to have the function. .. /k/ in cat , kit, scat, skit etc. • Step 4 : Statistical Modeling • Mapping phonemes to their phonetic representation statistics model. • Step 5: Matching • According to grammar , phonetic representation and dictionary ,the system returns an n- best list (i.e. a word plus a confidence score) • Grammar : the union words or phrases to constraint the range of input or output in the voice application.
  • 7. AUTOMATIC SPEECH RECOGNITION • Acoustic Model • Pronunciation Model • Language Model
  • 8. ACOUSTIC MODEL • Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform • Hidden Markov Model (HMM) is one most common type of acoustic models • The task of this model is to show the representation of the relationship between an audio signal and it’s phonemes. • In short it represents Sound in numbers.
  • 9. PRONUNCIATION MODEL • These models guide in detecting the variations and predicting the right word spoken. • These models help us take care of the pronunciation variation in the accent that we have face in our daily life. • Example: technology can be different in British and American accent. • It describes how a sequence or multi-sequences of fundamental speech units (such as phones or phonetic feature) are used to represent larger speech units such as words or phrases
  • 10. LANGUAGE MODEL • It provides the Context to distinguish between words and phrases that sound similar. • It’s the model that deals with the probabilities of sequences of words that will be able to predict from the given sound waveform • It’s the probability distribution over the sequence of words that helps us to be better in our prediction algorithm. • It constraints the search to a limited region by looking at the waveform and then predicting it.
  • 11. WHY WE NEED BETTER? • The traditional method are divided into 3 different models which are interdependent on each other which makes it hard for the community to experiment with . • The new suggested ways subsume all the three methods into one complete model that does the interlinking on it’s own . • It’s being implemented to avoid human mistakes that takes place and to use the machine learning capabilities to build better systems.
  • 12. SEQUENCE TO SEQUENCE MODEL • A sequence to Sequence model is about training models to convert sequences from one domain to Sequences of another domain. Whenever you feel the need to generate a text from audio you can easily use it. • We have a sequence of frames from a video and we want to analyze the action being performed in the video from these frame sequences. By dividing the video into multiple frames and then looking for the individual changes it leads to the identification of the actions performed. • This can be used to translate from one language to another as these all are a kind of sequences.
  • 13. LISTEN, ATTEND AND SPELL MODEL • It’s attention based sequence to sequence model that subsumes all the traditional models into one model. • Attention based model are based on two key things 1. Passing the information from the encoder to the decoder 2. Passing the context of the learning mechanism of the speech of where to pay attention. • This include an Encoder, which is analogous to traditional acoustic model. • An Attender, which acts as an alignment model.
  • 14. • And the last one Decoder, that is analogous to the language model in a conventional model. • These change were done to minimize the (MWER) Minimum Word Error Rate. • Listener Encoder : Similar to acoustic model takes the input features and maps them to higher level feature representation. • Attender : • Determines which encoder feature should be attended to in order to predict next output symbol • Decoder : • Takes the attention context generated from the attender,as well as the embedding of the previous prediction in order to produce a probability distribution.
  • 15. IMPROVEMENTS IN THE MODEL • Structural Improvements 1. Word piece model 2. Multi headed attention • Optimization Improvements 1. Minimum error rate training 2. Scheduled Sampling
  • 16. • Word Piece Model : • Traditional Use of graphemes as output units in sequence to sequence model helps fold the am pm lm models into one neural network. • The use of separate phonemes that require P.M. , L.M. model were not found to improve model accuracy over graphemes. • Much lower perplexity in word piece model allows for better decoder language model. • Modelling longer units (word piece )improves the effect of the decoder Short term memory) , and allows the model to potentially memorize pronunciations for frequently occurring words . • Words are partitioned deterministically and independent of context, using greedy algorithm to only get the words and not the possibility due to false prediction. • Longer units require fewer decoding steps, this speeds up inference in these models significantly.
  • 17. • Multi Headed Attention: • M.H.A. extends the conventional attention mechanism to have multiple heads, where each head can generate a different attention distribution. • This allows different heads to attend the encoder outputs differently and they have their own individual role. • Single headed architecture, the encoder here provides the model clear signals about the utterances so that the decoder can pick up the information with attention.
  • 18. OPTIMIZATION MODEL • Minimum Word Error rate Training : • The loss function that we optimize for the attention based system is a sequence level loss function, while not the word error rate. • The strategy is to minimize expected number of word errors . • Scheduled Sampling : • Feeding ground truth label as the previous prediction (so called teacher forcing) helps the decoder to learn quickly at the beginning but introduce a minimum difference between training and inference. • The scheduled sampling process on the other hand samples from the probability distribution of the previous prediction and then uses the resulting token to feed as the previous token when predicting the next label.
  • 19. CONCLUSION • Structural Improvements: • Word piece model performs slightly better than graphemes, resulting in roughly 2% relative improvement in W.E.R. (Word error rate). Rest is on the top of the M.H.A. and WPM model. • Optimizational Improvement: • Includes synchronous training on top of the W.P.M.+ M.H.A. model provides a 3.8 % improvement. Overall the optimizations is around 22.5% moving the W.E.R. from 8.0% to 6.2%
  • 20. FINAL OUTPUT • Sequence to sequence model, gives 11% relative improvement in W.E.R. but unidirectional L.A.S. system has some limitation. • The entire utterance must be seen by the encoder, before any labels can be decoded . • So, for the purpose of not having to look at all the utterance at the same time we need some online patterns and algorithms that will help us in the streaming attention based model .