SlideShare a Scribd company logo
1 of 29
Download to read offline
Automatic Speech Recognition:
Classical Methods to End-to-End Methods
MLILAB
Wonjun Jeong
Table of Contents
1. Background
a. Automatic Speech Recognition (ASR)
2. Classical Methods
a. Hidden Markov Model (HMM)
b. HMM-GMM
c. HMM-DNN
3. End-to-End Methods
a. Connectionist Temporal Classification (CTC)
b. Seq2Seq & Attention
c. SpecAugment
4. Conclusions
2
Backgrounds
● Automatic Speech Recognition(ASR)
What is Automatic Speech Recognition (ASR)?
4
Signal
Analysis
Speech
Decoded
Text
(Transcripti
on)
ASR
● Convert Analog signal to Digital signal (Spectogram)
○ AD conversion - Fast Fourier Transform (FFT)
○ Make frames (~10ms/frame) - mel-scale filter bank
=
Signal Analysis (Feature Extraction)
5
frame vector (Acoustic feature) = Input
● Phoneme
○ Unit of sound that distinguish one word from another in a particular language.
● Word
○ Lexicon(Phoneme + Phoneme + …) = Word
● Sentence
○ Word + Word + … = Sentence
○ Language Model := P(Sentence)
Phoneme, Word and Sentence
6
● Hidden Markov Model (HMM)
● HMM-GMM
● HMM-DNN
Classical Methods
Pipeline of Classical ASR
8
Signal
Analysis
Language
Model (LM)
Acoustic Model
(AM)
Training
Data
Pronunciation
Lexicon (PL)
Search Space
Speech
Decoded
Text
(Transcripti
on)
ASR
Fundamental Equation of Statistical Speech Recognition
9
● If X is the sequence of acoustic feature vectors (observations) and W denotes a
word sequence, the most likely word sequence W* is given by
● Applying Bayes’ Theorem:
HMM
Recall: Hidden Markov Model (HMM)
● HMM을 학습하고 나면,
○ 주어진 observation sequence 의 likelihood를 알 수 있다.
○ 주어진 observation sequence 에서 가장 probable한 hidden state sequence를 알 수 있다.
(Decoding)
● phoneme sequence를 HMM 으로 가정
○ Probability of a state depends on only on the previous state
○ Output observation depends only on the state that produced the observation
10
Calculation of P(X|W)
11
● Assume corresponds to phoneme /s/, the conditional probability
that we observe the sequence is
● HMM is employed to calculate it.
Calculation of P(X|W)
12
● Let’s consider a simple case where the length of input sequence is just one ,
, and the dimensionality of x is one (d=1)
● A Gaussian distribution function could be employed for this
● Given a set of training samples, we can estimate
Acoustic Model: Continuous Density HMM
13
● For a general case where a phone lasts more than one frame, we need to employ
HMM
● We need to define output distribution
Acoustic Model: Continuous Density HMM
14
● Output distribution: M-component Gaussian Mixture Model (GMM):
○ Individual components take responsibility for parts of the data set
○ Parameters all estimated from data by EM
● and then, train HMM: Baum-Welch algorithms (kind of EM)
○ Parameters λ:
Transition probabilities & Gaussian parameters for state j
Decoding
● Given an observation sequence and an HMM, determine the most probable hidden
state sequence (Viterbi algorithm)
15
After Decoding...
16
HMM-DNN
● Using Deep Neural Network posterior instead of GMM
17
Limitation
● AM and LM are trained separately, each with a different objective
● Too many steps
18
Signal
Analysis
Language
Model (LM)
Acoustic
Model (AM)
Training
Data
Pronunciation
Lexicon (PL)
Search
Space
Speech
Decoded
Text
(Transcrip
tion)
ASR
● Connectionist Temporal Classification (CTC)
● Seq2Seq & Attention
● SpecAugment
End-to-End Methods
Connectionist Temporal Classification (CTC)
● The spectograms are processed by RNN with CTC output layer.
○ NN trained as frame-level classifier, it requires the alignment between audio and transcription
sequences.
● The network is trained directly on the text transcripts:
○ no phonetic representation
○ no pronunciation lexicon
● Directly optimise the word error rate
20
Connectionist Temporal Classification (CTC)
● Objective function that allows an RNN to be trained for sequence transcription
tasks without requiring any prior alignment between the input and target sequences.
● Output layer contains a single unit for each of the transcription labels, plus an extra
unit referred as the ‘blank’
○ (a,a,a,-,b)
21
Connectionist Temporal Classification (CTC)
● Given a length T input sequence x, the output vectors are normalised with the
softmax.
○ k th element of
● CTC alignment is a length T sequence of blank and label indices
○ a =
● The ‘integrating out’ over possible alignment
○ y = = B(a)
22
Connectionist Temporal Classification (CTC)
● Limitation
○ CTC assumes independence between acoustic frames.
○ But, Natural Language has dependency between previous phonemes(words) and current
phonemes(words)
■ Context
○ It needs strong language model (LM)
23
Recall: Seq2Seq & Attention
● Seq2Seq
○ Encoder: input sequence > context vector
○ Decoder: context > output sequence
● Attention
○ Different parts of an input have different levels of significance.
24
Attention can be thought alignment
● Attention scores at decoding time step i signify the features in the acoustic that align
with text i in the target. (Listen, Attend and Spell)
25
SpecAugment (SOTA 2019)
● Data augmentation method for spectogram
○ Time warping
○ Frequency masking
○ Time masking
26
Benchmark WER
27
Conclusions
● ASR은 긴 역사를 가지고 있고, 최근들어 딥러닝을 접목한 방법들이 성과를
내고 있지만 아직 HMM 방식을 크게 앞서지 못하고 있음
○ 실제 상용화된 제품들은 대부분 HMM 을 기반으로 하고 있음
● 데이터 전처리가 결과에 주는 영향이 큼 (Signal Analysis)
○ Raw Waveform을 바로 학습하는 모델도 있으나, Spectogram 만큼 성능이 안나옴
● 학습 환경이랑 실제 환경이 비전 테스크에 비해 많이 다름
○ 주변 잡음
○ 같은 phoneme의 variance가 너무 큼
■ 개인의 억양, 성별, 방언, 화자의 기분 상태
● 어려운 만큼 블루오션(?)
○ Pytorch-Kaldi (2019)
■ waveform > signal analysis > NN train
28
Thank you!
29

More Related Content

What's hot

Introduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationIntroduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationAmrinder Arora
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
Contract-Based Integration of Cyber-Physical Analyses (Poster)
Contract-Based Integration of Cyber-Physical Analyses (Poster)Contract-Based Integration of Cyber-Physical Analyses (Poster)
Contract-Based Integration of Cyber-Physical Analyses (Poster)Ivan Ruchkin
 
Computability - Tractable, Intractable and Non-computable Function
Computability - Tractable, Intractable and Non-computable FunctionComputability - Tractable, Intractable and Non-computable Function
Computability - Tractable, Intractable and Non-computable FunctionReggie Niccolo Santos
 
Basic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusBasic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusNANDINI SHARMA
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsNagendraK18
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huangSagar Ahir
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice CryptographyPriyanka Aash
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisNisha Soms
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisAnindita Kundu
 
Mat lab for bplc
Mat lab for bplcMat lab for bplc
Mat lab for bplcwendye13
 

What's hot (20)

Correlation
CorrelationCorrelation
Correlation
 
Introduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationIntroduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic Notation
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Regularization
RegularizationRegularization
Regularization
 
NP completeness
NP completenessNP completeness
NP completeness
 
Contract-Based Integration of Cyber-Physical Analyses (Poster)
Contract-Based Integration of Cyber-Physical Analyses (Poster)Contract-Based Integration of Cyber-Physical Analyses (Poster)
Contract-Based Integration of Cyber-Physical Analyses (Poster)
 
Complexity of Algorithm
Complexity of AlgorithmComplexity of Algorithm
Complexity of Algorithm
 
Computability - Tractable, Intractable and Non-computable Function
Computability - Tractable, Intractable and Non-computable FunctionComputability - Tractable, Intractable and Non-computable Function
Computability - Tractable, Intractable and Non-computable Function
 
Sns pre sem
Sns pre semSns pre sem
Sns pre sem
 
Unit 1
Unit 1Unit 1
Unit 1
 
Basic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusBasic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV Syllabus
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Tmp
TmpTmp
Tmp
 
Algorithmic Notations
Algorithmic NotationsAlgorithmic Notations
Algorithmic Notations
 
Turing machine
Turing machineTuring machine
Turing machine
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huang
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice Cryptography
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
 
Mat lab for bplc
Mat lab for bplcMat lab for bplc
Mat lab for bplc
 

Similar to Speech recognition: Survey

Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition子毅 楊
 
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討Tomoki Koriyama
 
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time Coding
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time CodingTwo-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time Coding
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time CodingWalter Freitas
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsDeha Deniz Türköz
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxHamzaJaved306957
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
Channel Estimation In The STTC For OFDM Using MIMO With 4G System
Channel Estimation In The STTC For OFDM Using MIMO With 4G SystemChannel Estimation In The STTC For OFDM Using MIMO With 4G System
Channel Estimation In The STTC For OFDM Using MIMO With 4G SystemIOSR Journals
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Lviv Startup Club
 
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsLTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsIllaKolani1
 

Similar to Speech recognition: Survey (20)

Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition
 
Mimo
MimoMimo
Mimo
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
Mimo
MimoMimo
Mimo
 
4g lte matlab
4g lte matlab4g lte matlab
4g lte matlab
 
Mimo
MimoMimo
Mimo
 
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
 
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time Coding
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time CodingTwo-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time Coding
Two-Way MIMO Decode-and-Forward Relaying Systems with Tensor Space-Time Coding
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
Channel Estimation In The STTC For OFDM Using MIMO With 4G System
Channel Estimation In The STTC For OFDM Using MIMO With 4G SystemChannel Estimation In The STTC For OFDM Using MIMO With 4G System
Channel Estimation In The STTC For OFDM Using MIMO With 4G System
 
I010125056
I010125056I010125056
I010125056
 
D111823
D111823D111823
D111823
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
t23notes
t23notest23notes
t23notes
 
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsLTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Speech recognition: Survey

  • 1. Automatic Speech Recognition: Classical Methods to End-to-End Methods MLILAB Wonjun Jeong
  • 2. Table of Contents 1. Background a. Automatic Speech Recognition (ASR) 2. Classical Methods a. Hidden Markov Model (HMM) b. HMM-GMM c. HMM-DNN 3. End-to-End Methods a. Connectionist Temporal Classification (CTC) b. Seq2Seq & Attention c. SpecAugment 4. Conclusions 2
  • 4. What is Automatic Speech Recognition (ASR)? 4 Signal Analysis Speech Decoded Text (Transcripti on) ASR
  • 5. ● Convert Analog signal to Digital signal (Spectogram) ○ AD conversion - Fast Fourier Transform (FFT) ○ Make frames (~10ms/frame) - mel-scale filter bank = Signal Analysis (Feature Extraction) 5 frame vector (Acoustic feature) = Input
  • 6. ● Phoneme ○ Unit of sound that distinguish one word from another in a particular language. ● Word ○ Lexicon(Phoneme + Phoneme + …) = Word ● Sentence ○ Word + Word + … = Sentence ○ Language Model := P(Sentence) Phoneme, Word and Sentence 6
  • 7. ● Hidden Markov Model (HMM) ● HMM-GMM ● HMM-DNN Classical Methods
  • 8. Pipeline of Classical ASR 8 Signal Analysis Language Model (LM) Acoustic Model (AM) Training Data Pronunciation Lexicon (PL) Search Space Speech Decoded Text (Transcripti on) ASR
  • 9. Fundamental Equation of Statistical Speech Recognition 9 ● If X is the sequence of acoustic feature vectors (observations) and W denotes a word sequence, the most likely word sequence W* is given by ● Applying Bayes’ Theorem: HMM
  • 10. Recall: Hidden Markov Model (HMM) ● HMM을 학습하고 나면, ○ 주어진 observation sequence 의 likelihood를 알 수 있다. ○ 주어진 observation sequence 에서 가장 probable한 hidden state sequence를 알 수 있다. (Decoding) ● phoneme sequence를 HMM 으로 가정 ○ Probability of a state depends on only on the previous state ○ Output observation depends only on the state that produced the observation 10
  • 11. Calculation of P(X|W) 11 ● Assume corresponds to phoneme /s/, the conditional probability that we observe the sequence is ● HMM is employed to calculate it.
  • 12. Calculation of P(X|W) 12 ● Let’s consider a simple case where the length of input sequence is just one , , and the dimensionality of x is one (d=1) ● A Gaussian distribution function could be employed for this ● Given a set of training samples, we can estimate
  • 13. Acoustic Model: Continuous Density HMM 13 ● For a general case where a phone lasts more than one frame, we need to employ HMM ● We need to define output distribution
  • 14. Acoustic Model: Continuous Density HMM 14 ● Output distribution: M-component Gaussian Mixture Model (GMM): ○ Individual components take responsibility for parts of the data set ○ Parameters all estimated from data by EM ● and then, train HMM: Baum-Welch algorithms (kind of EM) ○ Parameters λ: Transition probabilities & Gaussian parameters for state j
  • 15. Decoding ● Given an observation sequence and an HMM, determine the most probable hidden state sequence (Viterbi algorithm) 15
  • 17. HMM-DNN ● Using Deep Neural Network posterior instead of GMM 17
  • 18. Limitation ● AM and LM are trained separately, each with a different objective ● Too many steps 18 Signal Analysis Language Model (LM) Acoustic Model (AM) Training Data Pronunciation Lexicon (PL) Search Space Speech Decoded Text (Transcrip tion) ASR
  • 19. ● Connectionist Temporal Classification (CTC) ● Seq2Seq & Attention ● SpecAugment End-to-End Methods
  • 20. Connectionist Temporal Classification (CTC) ● The spectograms are processed by RNN with CTC output layer. ○ NN trained as frame-level classifier, it requires the alignment between audio and transcription sequences. ● The network is trained directly on the text transcripts: ○ no phonetic representation ○ no pronunciation lexicon ● Directly optimise the word error rate 20
  • 21. Connectionist Temporal Classification (CTC) ● Objective function that allows an RNN to be trained for sequence transcription tasks without requiring any prior alignment between the input and target sequences. ● Output layer contains a single unit for each of the transcription labels, plus an extra unit referred as the ‘blank’ ○ (a,a,a,-,b) 21
  • 22. Connectionist Temporal Classification (CTC) ● Given a length T input sequence x, the output vectors are normalised with the softmax. ○ k th element of ● CTC alignment is a length T sequence of blank and label indices ○ a = ● The ‘integrating out’ over possible alignment ○ y = = B(a) 22
  • 23. Connectionist Temporal Classification (CTC) ● Limitation ○ CTC assumes independence between acoustic frames. ○ But, Natural Language has dependency between previous phonemes(words) and current phonemes(words) ■ Context ○ It needs strong language model (LM) 23
  • 24. Recall: Seq2Seq & Attention ● Seq2Seq ○ Encoder: input sequence > context vector ○ Decoder: context > output sequence ● Attention ○ Different parts of an input have different levels of significance. 24
  • 25. Attention can be thought alignment ● Attention scores at decoding time step i signify the features in the acoustic that align with text i in the target. (Listen, Attend and Spell) 25
  • 26. SpecAugment (SOTA 2019) ● Data augmentation method for spectogram ○ Time warping ○ Frequency masking ○ Time masking 26
  • 28. Conclusions ● ASR은 긴 역사를 가지고 있고, 최근들어 딥러닝을 접목한 방법들이 성과를 내고 있지만 아직 HMM 방식을 크게 앞서지 못하고 있음 ○ 실제 상용화된 제품들은 대부분 HMM 을 기반으로 하고 있음 ● 데이터 전처리가 결과에 주는 영향이 큼 (Signal Analysis) ○ Raw Waveform을 바로 학습하는 모델도 있으나, Spectogram 만큼 성능이 안나옴 ● 학습 환경이랑 실제 환경이 비전 테스크에 비해 많이 다름 ○ 주변 잡음 ○ 같은 phoneme의 variance가 너무 큼 ■ 개인의 억양, 성별, 방언, 화자의 기분 상태 ● 어려운 만큼 블루오션(?) ○ Pytorch-Kaldi (2019) ■ waveform > signal analysis > NN train 28