Speech recognition: Survey

Automatic Speech Recognition:
Classical Methods to End-to-End Methods
MLILAB
Wonjun Jeong

Table of Contents
1. Background
a. Automatic Speech Recognition (ASR)
2. Classical Methods
a. Hidden Markov Model (HMM)
b. HMM-GMM
c. HMM-DNN
3. End-to-End Methods
a. Connectionist Temporal Classification (CTC)
b. Seq2Seq & Attention
c. SpecAugment
4. Conclusions
2

Backgrounds
● Automatic Speech Recognition(ASR)

What is Automatic Speech Recognition (ASR)?
4
Signal
Analysis
Speech
Decoded
Text
(Transcripti
on)
ASR

● Convert Analog signal to Digital signal (Spectogram)
○ AD conversion - Fast Fourier Transform (FFT)
○ Make frames (~10ms/frame) - mel-scale filter bank
=
Signal Analysis (Feature Extraction)
5
frame vector (Acoustic feature) = Input

● Phoneme
○ Unit of sound that distinguish one word from another in a particular language.
● Word
○ Lexicon(Phoneme + Phoneme + …) = Word
● Sentence
○ Word + Word + … = Sentence
○ Language Model := P(Sentence)
Phoneme, Word and Sentence
6

● Hidden Markov Model (HMM)
● HMM-GMM
● HMM-DNN
Classical Methods

Pipeline of Classical ASR
8
Signal
Analysis
Language
Model (LM)
Acoustic Model
(AM)
Training
Data
Pronunciation
Lexicon (PL)
Search Space
Speech
Decoded
Text
(Transcripti
on)
ASR

Fundamental Equation of Statistical Speech Recognition
9
● If X is the sequence of acoustic feature vectors (observations) and W denotes a
word sequence, the most likely word sequence W* is given by
● Applying Bayes’ Theorem:
HMM

Recall: Hidden Markov Model (HMM)
● HMM을 학습하고 나면,
○ 주어진 observation sequence 의 likelihood를 알 수 있다.
○ 주어진 observation sequence 에서 가장 probable한 hidden state sequence를 알 수 있다.
(Decoding)
● phoneme sequence를 HMM 으로 가정
○ Probability of a state depends on only on the previous state
○ Output observation depends only on the state that produced the observation
10

Calculation of P(X|W)
11
● Assume corresponds to phoneme /s/, the conditional probability
that we observe the sequence is
● HMM is employed to calculate it.

Calculation of P(X|W)
12
● Let’s consider a simple case where the length of input sequence is just one ,
, and the dimensionality of x is one (d=1)
● A Gaussian distribution function could be employed for this
● Given a set of training samples, we can estimate

Acoustic Model: Continuous Density HMM
13
● For a general case where a phone lasts more than one frame, we need to employ
HMM
● We need to define output distribution

Acoustic Model: Continuous Density HMM
14
● Output distribution: M-component Gaussian Mixture Model (GMM):
○ Individual components take responsibility for parts of the data set
○ Parameters all estimated from data by EM
● and then, train HMM: Baum-Welch algorithms (kind of EM)
○ Parameters λ:
Transition probabilities & Gaussian parameters for state j

Decoding
● Given an observation sequence and an HMM, determine the most probable hidden
state sequence (Viterbi algorithm)
15

HMM-DNN
● Using Deep Neural Network posterior instead of GMM
17

Limitation
● AM and LM are trained separately, each with a different objective
● Too many steps
18
Signal
Analysis
Language
Model (LM)
Acoustic
Model (AM)
Training
Data
Pronunciation
Lexicon (PL)
Search
Space
Speech
Decoded
Text
(Transcrip
tion)
ASR

● Connectionist Temporal Classification (CTC)
● Seq2Seq & Attention
● SpecAugment
End-to-End Methods

Connectionist Temporal Classification (CTC)
● The spectograms are processed by RNN with CTC output layer.
○ NN trained as frame-level classifier, it requires the alignment between audio and transcription
sequences.
● The network is trained directly on the text transcripts:
○ no phonetic representation
○ no pronunciation lexicon
● Directly optimise the word error rate
20

● Objective function that allows an RNN to be trained for sequence transcription
tasks without requiring any prior alignment between the input and target sequences.
● Output layer contains a single unit for each of the transcription labels, plus an extra
unit referred as the ‘blank’
○ (a,a,a,-,b)
21

● Given a length T input sequence x, the output vectors are normalised with the
softmax.
○ k th element of
● CTC alignment is a length T sequence of blank and label indices
○ a =
● The ‘integrating out’ over possible alignment
○ y = = B(a)
22

● Limitation
○ CTC assumes independence between acoustic frames.
○ But, Natural Language has dependency between previous phonemes(words) and current
phonemes(words)
■ Context
○ It needs strong language model (LM)
23

Recall: Seq2Seq & Attention
● Seq2Seq
○ Encoder: input sequence > context vector
○ Decoder: context > output sequence
● Attention
○ Different parts of an input have different levels of significance.
24

Attention can be thought alignment
● Attention scores at decoding time step i signify the features in the acoustic that align
with text i in the target. (Listen, Attend and Spell)
25

SpecAugment (SOTA 2019)
● Data augmentation method for spectogram
○ Time warping
○ Frequency masking
○ Time masking
26

Conclusions
● ASR은 긴 역사를 가지고 있고, 최근들어 딥러닝을 접목한 방법들이 성과를
내고 있지만 아직 HMM 방식을 크게 앞서지 못하고 있음
○ 실제 상용화된 제품들은 대부분 HMM 을 기반으로 하고 있음
● 데이터 전처리가 결과에 주는 영향이 큼 (Signal Analysis)
○ Raw Waveform을 바로 학습하는 모델도 있으나, Spectogram 만큼 성능이 안나옴
● 학습 환경이랑 실제 환경이 비전 테스크에 비해 많이 다름
○ 주변 잡음
○ 같은 phoneme의 variance가 너무 큼
■ 개인의 억양, 성별, 방언, 화자의 기분 상태
● 어려운 만큼 블루오션(?)
○ Pytorch-Kaldi (2019)
■ waveform > signal analysis > NN train
28

Speech recognition: Survey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speech recognition: Survey

Similar to Speech recognition: Survey (20)

Recently uploaded

Recently uploaded (20)

Speech recognition: Survey