Automatic Speech Recognition techniques have evolved from classical methods like Hidden Markov Models to end-to-end neural network approaches. Classical methods train the acoustic and language models separately, while end-to-end methods directly optimize the transcription task. Connectionist Temporal Classification allows training RNNs directly on text without alignment, but requires strong language models. Seq2Seq models with attention aim to learn alignment between input and output sequences. SpecAugment improves performance through simple data augmentation of spectrograms during training. While end-to-end methods are improving, classical HMM approaches remain dominant in commercial products due to their robustness in noisy real-world conditions.
2. Table of Contents
1. Background
a. Automatic Speech Recognition (ASR)
2. Classical Methods
a. Hidden Markov Model (HMM)
b. HMM-GMM
c. HMM-DNN
3. End-to-End Methods
a. Connectionist Temporal Classification (CTC)
b. Seq2Seq & Attention
c. SpecAugment
4. Conclusions
2
4. What is Automatic Speech Recognition (ASR)?
4
Signal
Analysis
Speech
Decoded
Text
(Transcripti
on)
ASR
5. ● Convert Analog signal to Digital signal (Spectogram)
○ AD conversion - Fast Fourier Transform (FFT)
○ Make frames (~10ms/frame) - mel-scale filter bank
=
Signal Analysis (Feature Extraction)
5
frame vector (Acoustic feature) = Input
6. ● Phoneme
○ Unit of sound that distinguish one word from another in a particular language.
● Word
○ Lexicon(Phoneme + Phoneme + …) = Word
● Sentence
○ Word + Word + … = Sentence
○ Language Model := P(Sentence)
Phoneme, Word and Sentence
6
8. Pipeline of Classical ASR
8
Signal
Analysis
Language
Model (LM)
Acoustic Model
(AM)
Training
Data
Pronunciation
Lexicon (PL)
Search Space
Speech
Decoded
Text
(Transcripti
on)
ASR
9. Fundamental Equation of Statistical Speech Recognition
9
● If X is the sequence of acoustic feature vectors (observations) and W denotes a
word sequence, the most likely word sequence W* is given by
● Applying Bayes’ Theorem:
HMM
10. Recall: Hidden Markov Model (HMM)
● HMM을 학습하고 나면,
○ 주어진 observation sequence 의 likelihood를 알 수 있다.
○ 주어진 observation sequence 에서 가장 probable한 hidden state sequence를 알 수 있다.
(Decoding)
● phoneme sequence를 HMM 으로 가정
○ Probability of a state depends on only on the previous state
○ Output observation depends only on the state that produced the observation
10
11. Calculation of P(X|W)
11
● Assume corresponds to phoneme /s/, the conditional probability
that we observe the sequence is
● HMM is employed to calculate it.
12. Calculation of P(X|W)
12
● Let’s consider a simple case where the length of input sequence is just one ,
, and the dimensionality of x is one (d=1)
● A Gaussian distribution function could be employed for this
● Given a set of training samples, we can estimate
13. Acoustic Model: Continuous Density HMM
13
● For a general case where a phone lasts more than one frame, we need to employ
HMM
● We need to define output distribution
14. Acoustic Model: Continuous Density HMM
14
● Output distribution: M-component Gaussian Mixture Model (GMM):
○ Individual components take responsibility for parts of the data set
○ Parameters all estimated from data by EM
● and then, train HMM: Baum-Welch algorithms (kind of EM)
○ Parameters λ:
Transition probabilities & Gaussian parameters for state j
15. Decoding
● Given an observation sequence and an HMM, determine the most probable hidden
state sequence (Viterbi algorithm)
15
18. Limitation
● AM and LM are trained separately, each with a different objective
● Too many steps
18
Signal
Analysis
Language
Model (LM)
Acoustic
Model (AM)
Training
Data
Pronunciation
Lexicon (PL)
Search
Space
Speech
Decoded
Text
(Transcrip
tion)
ASR
20. Connectionist Temporal Classification (CTC)
● The spectograms are processed by RNN with CTC output layer.
○ NN trained as frame-level classifier, it requires the alignment between audio and transcription
sequences.
● The network is trained directly on the text transcripts:
○ no phonetic representation
○ no pronunciation lexicon
● Directly optimise the word error rate
20
21. Connectionist Temporal Classification (CTC)
● Objective function that allows an RNN to be trained for sequence transcription
tasks without requiring any prior alignment between the input and target sequences.
● Output layer contains a single unit for each of the transcription labels, plus an extra
unit referred as the ‘blank’
○ (a,a,a,-,b)
21
22. Connectionist Temporal Classification (CTC)
● Given a length T input sequence x, the output vectors are normalised with the
softmax.
○ k th element of
● CTC alignment is a length T sequence of blank and label indices
○ a =
● The ‘integrating out’ over possible alignment
○ y = = B(a)
22
23. Connectionist Temporal Classification (CTC)
● Limitation
○ CTC assumes independence between acoustic frames.
○ But, Natural Language has dependency between previous phonemes(words) and current
phonemes(words)
■ Context
○ It needs strong language model (LM)
23
24. Recall: Seq2Seq & Attention
● Seq2Seq
○ Encoder: input sequence > context vector
○ Decoder: context > output sequence
● Attention
○ Different parts of an input have different levels of significance.
24
25. Attention can be thought alignment
● Attention scores at decoding time step i signify the features in the acoustic that align
with text i in the target. (Listen, Attend and Spell)
25
26. SpecAugment (SOTA 2019)
● Data augmentation method for spectogram
○ Time warping
○ Frequency masking
○ Time masking
26
28. Conclusions
● ASR은 긴 역사를 가지고 있고, 최근들어 딥러닝을 접목한 방법들이 성과를
내고 있지만 아직 HMM 방식을 크게 앞서지 못하고 있음
○ 실제 상용화된 제품들은 대부분 HMM 을 기반으로 하고 있음
● 데이터 전처리가 결과에 주는 영향이 큼 (Signal Analysis)
○ Raw Waveform을 바로 학습하는 모델도 있으나, Spectogram 만큼 성능이 안나옴
● 학습 환경이랑 실제 환경이 비전 테스크에 비해 많이 다름
○ 주변 잡음
○ 같은 phoneme의 variance가 너무 큼
■ 개인의 억양, 성별, 방언, 화자의 기분 상태
● 어려운 만큼 블루오션(?)
○ Pytorch-Kaldi (2019)
■ waveform > signal analysis > NN train
28